100% found this document useful (1 vote)
241 views28 pages

Chapter-2 DATA WAREHOUSE PDF

Download as pdf or txt
Download as pdf or txt
Download as pdf or txt
You are on page 1/ 28

UNIT-1

Chapter-2
DATA WAREHOUSING

Data warehousing is the process of combining enterprise-wide mutual data into a single
storage area from which end-users can easily run queries, make reports and perform analysis. Data
warehousing is the data management and analysis technology adopting an update driven principle.
Data warehouse systems are valuable tools in today's competitive and fast evolving world. The data
warehouse is a new approach to enterprise-wide computing at the architectural level. A data
warehouse can provide a central repository for large amounts of diverse and valuable information.
Data warehouse supports business analysis and decision making by creating an enterprise-wide
integrated database of summarized, historical information. It integrates data from multiple,
incompatible sources. By transforming data into meaningful information, a data warehouse allows
the business manager to perform more substantive, accurate and consistent analysis. Data
warehousing improves the productivity of corporate decision-makers through consolidation,
conversion, transformation and integration of operational data and provides a consistent view of the
enterprise.

Difference between Data warehouse and database:


 A data warehouse is a storage place where data gets stored so that applications can access
and share it easily. But a database does that already. So then, we can say that a data
warehouse is a database of different kind.
 The main difference is that usual (or, traditional) databases hold operational-type (most
often, transactional-type) data and that many of the decision-support type applications put too
much strain on the databases dominating into the day-to-day operation.
 A data warehouse contains summarized information. In general, our database is not a data
warehouse except we also collect and summarizes information from dissimilar sources and
uses it as the place where this difference can be reconciled, and place the data into a
warehouse because we mean to allow several different applications to make use of the same
information.
 A data warehouse refers to a database that is maintained separately from an organization's
operational databases. An operational database is designed and tuned for known tasks and
workloads, such as indexing and hashing (jumbling or mixing up) using primary keys,
searching for particular records and optimized queries.
 Alternatively, data warehouse queries are often very complex. They involve the computation
of large groups of data at the summarized level, and may require the use of special data
organizations; access and implementations methods based on multidimensional views. A
warehouse holds read-only data. These criteria bring the term, data warehouse, much closer
to our understanding of a warehouse as a place where we store many different things for the
sake of convenience.

DEFINITION:
A formal definition of a data warehouse by W H Inmon (in 1993)
“A data warehouse is a subject-oriented, integrated, time-varying, non-volatile collection of
data in support of the management's decision-making process”.
 SUBJECT-ORIENTED:

 A data warehouse is organized around major subjects such as customer, products, sales, etc.
 Data are organized according to subject instead of application.
 For example, an insurance company using a data warehouse would organize their data by
customer, premium, and claim instead of by different products (auto, life, etc.).
 The data organized by subject obtains only the information necessary for the decision
support processing.
 Information is presented according to specific subjects or areas of interest, Data is
manipulated to provide information about a particular subject.
 Focusing on the modeling and analysis of data for decision makers, not on daily operations or
transaction processing.
 Provide a simple and concise view around particular subjects issues by excluding data that
are not useful in the decision support process.
 In operational systems data is stored by individual applications or business processes. Like
data about individual order, customer etc.
 In data warehouse data is stored by real world business objects or events not by the
applications.

 NON-VOLATILE:

 A data warehouse is always a physically separate store of data, which is transformed


from the application data found in the appropriate environment.
 Due to this separation, data warehouses do not require transaction processing,
recovery, concurrency control, etc.
 The data are not updated or changed in any way once they enter the data warehouse,
but are only loaded, refreshed and accessed for queries.
 When data is stored and committed. It can be read only and never deleted for
comparison with newer data.
 TIME-VARYING:

 Data are stored in a data warehouse to provide a historical perspective.


 Every key structure in the data warehouse contains, implicitly or explicitly, an element of
time.
 The data warehouse contains a place for sorting data that are 5 to 10 years old, or older, to
be used for comparisons, trends and forecasting.
 Data changes are recorded and tracked so that a change patterns can be determined
overtime.

 INTEGRATED:

 A single source of information for and about understanding multiple areas of interest.
 The data warehouse provides one-stop shopping and contains information about a variety
of subjects.
 For Example: The University data warehouse has information on faculties, students
and staff, instrumental workload and students outcomes etc.,
 Data warehouse integrates the data so that inconsistencies are removed.
 A data warehouse is usually constructed by integrating multiple; heterogeneous sources
such as relational databases, flat files, and OLTP (On Line Analytical Processing) files.
 When data resides in many separate applications in the operational environment, the
encoding of data is often inconsistent.
 If one application gender may be coded as m/f and in another it might be 0 and 1.
 When data are moved from operational environment into the data warehouse, they assume
a consistent coding convention i.e. gender data transformed into m and f.
DATA WAREHOUSING CONCEPTS:
The most concept of data warehouse is the multidimensional data model.

 MULTIDIMENSIONAL DATA MODEL:

 At the core of the design of the data warehouse lies a multidimensional view of the data
model.
 Multidimensional data model stores data in the form of data cube.Mostly, data warehousing
supports two or three-dimensional cubes.
 A data cube allows data to be viewed in multiple dimensions.A dimensions are entities with
respect to which an organization wants to keep records.
 Dimension model is developed for implementing data warehouse and data marts.
 MDDM provide both a mechanism to store data and a way for business analysis.
 The multi dimensional data model is the integral part of OLAP because it provides answers
quickly.
 The multi dimensional data model is designed to solve complex queries in real time.
 The multi-dimensional data model is composed of logical cubes, measures, dimensions,
hierarchies, levels and attributes.
 For example in store sales record, dimensions allow the store to keep track of things like
monthly sales of items and the branches and locations.
 A multidimensional databases helps to provide data-related answers to complex business
queries quickly and accurately.
 Data warehouses and Online Analytical Processing (OLAP) tools are based on a
multidimensional data model.
 OLAP in data warehousing enables users to view data from different angles and dimensions.
 There are two components of MDDM:
I. Dimensions : Texture attributes to analyses of data.
II. Facts : Numeric volume to analyze business

Fig: 2-Dimensional View


Consider the data set represented in the following table as the 2 dimensional table, it shows
employment in California by gender, by year and by profession. We observe that the rows and the
columns represent more than one dimension, if the data set contains more than 2 dimensions. The
rows in Table represent the two dimensions; gender and year, which are ordered as gender first then
year .The columns, however, do not really represent 2 distinct dimensions but they do represent
some sort of hierarchies. The professional class and profession represent a hierarchical relationship
between the instances of professional class and the instances of the profession.
o DATA CUBE:

 When data is grouped or combined together in multi dimensional matrices called Data
Cubes.
 Changing from one dimensional hierarchy to another is clearly accomplished in data
cube by a technique called piroting(also known as rotation).
 A popular conceptual model that influences data warehouse architecture is a
multidimensional view of data.
 Figure demonstrates a multidimensional view of the information corresponding to
above Table. This model views data in the form of a data cube (or, more precisely, a
hypercube). It has three dimensions, namely gender, profession and year. Each
dimension can be divided into sub dimensions.
Multidimensional representation of data:

In a multidimensional data model, there is a set of numeric measures that are the main theme or
subject of the analysis. In this example, the numeric measure is employment. We can have more
than one numeric measure. Examples of numeric measures are sales, budget, revenue, inventory,
population, etc. Each numeric measure depends on a set of dimensions, which provide the context
for the measure. All the dimensions together are assumed to uniquely determine the measure. Thus,
the multidimensional data views a measure as a value placed in a cell in the multidimensional space.
Each dimension, in turn, is described by a set of attributes. In general, dimensions are the
perspectives or entities with respect to which an organization wants to keep records.

The formal definition of data cube is an n-dimensional data cube, C [A1 A2, ..…An], is a
database with n dimensions as A1, A2, …. An , each of which represents a theme and contains
lAil number of distinct elements in the dimension Ai, Each distinct element of AI corresponds
to a data row of C. A data cell in the cube, C [a1, a2, ..., an] stores the numeric measures of the
data for Ai=ai , for all i. Thus, a data cell corresponds to an instantiation of all dimensions.
In the above example, C [gender, profession, year] is the data cube, and a data cell C [male, civil
engineer, 1992] stores 2780 as its associated measure. As Igenderl = 2, I profession I = 6 and I year
I= 5, we have three dimensions with 2, 6 and 5 rows, respectively.
o DIMENSION MODELLING:
 The concept of a dimension provides a lot of semantic information, especially about the
hierarchical relationship between its elements.
 Dimension modelling is a special technique for structuring data around business
concepts.
 Unlike ER modelling, which describes entities and relationships, dimension modelling
structures the numeric measures and the dimensions.
 The dimension schema can represent the details of the dimensional modelling. The
following figures show the dimension modelling for our example.
Figure: Dimension modelling

o LATTICE OF CUBOIDS:

 The dimension hierarchy helps us view the multidimensional data in several different
data cube representations.
 Multidimensional data can be viewed as a lattice of cuboids.
 The bottom most cuboid is the base cuboid, It consists all the data cells.
 The top most cuboid is the apex cuboid, It can contains only one cell with numeric
measures of all n dimensions.
 In lattice of cuboids, all other cuboids lie between the base cuboid and the apex cuboid.
 The C [A1 , A2 ,... An] at the best level of granularity is called the base cuboid and it
consists of all the data cells.
 The (n-1) - D cubes are obtained by grouping the cells and computing the combined
numeric measure of a given dimension.
 Finally, the coarsest level consists of one cell with numeric measures of all n
dimensions.
This is called an apex cuboid.
 In the lattice of cuboids, all other cuboids lie between the base cuboid and the apex
cuboid.
 For example, the lattice of cuboids is a trivial one and it contains just two cuboids-the
base cuboid and the apex cuboid.
 Consider an example a store called, Deccan Electronics may create a sales data
warehouse in order to keep records of the store's sales with respect to the time, product
and location. Thus, the dimensions are time, product and location. These dimensions
allow the store to keep track of things like monthly sales of item and the locations at
which the items were sold. Dimension tables for product contain: the attributes item
name, brand, and type. The attributes shop, manager, city, region, state, and country
describe the dimension location.

Fig: Data cube

 These attributes are related by a total order forming a hierarchy, such as shop<city<state <
country.
 This hierarchy shown in Figure. An example of a partial order for the time dimension is the
attributes week, month, quarter and year.
 The sales data warehouse includes the sales amount in rupees and the total number of units
sold. Note that we can have more than one numeric measure.
 Figure shows the multidimensional model for such situations. The dimension hierarchies
considered for the data cube are time:(month <quarter < year); location: (city < province <
country); and product.
 Figure shows the cuboid C[quarter,city,product]. In the given dimensional hierarchies, the
base cuboid of the lattice is C[month,city,product] and the apex cuboid is
C[year,country,product]. Other intermediate cuboids in the lattice are C (quarter, province,
product],C [quarter, country, product], C [month, province, product], C {month, country,
product], C [year, city, product] and C [year, province, product].
o Dimension schema:

The multidimensional data model has the following conceptual basic components:
1. Summary measure: e.g., employment, sales, etc.
2. Summary function: e.g., sum
3. Dimension: e.g., gender, year, profession, state
4. Dimension hierarchy: e.g., professional class , profession.
SUMMARY MEASURES: Summary measure is essentially the main theme of the analysis in a
multidimensional model. A measure value is computed for a given cell by aggregating the data
corresponding to the respective dimension-value sets defining the cell. The measures can be
categorized into 3 groups based on the kind of aggregate Function used.
 Distributive: A numeric measure is distributive if it can be computed in a distributed
manner as follows. Suppose the data is partitioned into a few subsets. The measure can be
simply the aggregation of the measures of all partitions. For example, count, sum, min and
max are distributive measures.
 Algebraic: An aggregate function is algebraic if it can be computed by an algebraic
function with some set of arguments, each of which may be obtained by a distributive
measure. For example, average is obtained by sum/count.Other examples of distributive
functions are standard-deviations and center-of mass.
 Holistic: An aggregate function is holistic if there is no constant bound on the storage
size needed to describe a sub aggregate. That is, there does not exist an algebraic function
that can be used to compute this function. Examples of such functions are median, mode,
most-frequent, etc.

 OLAP OPERATIONS:

 Once we model our data warehouse in the form of a multidimensional data cube, it is
Necessary to explore the different analytical tools with which to perform the complex
analysis of data.
 These data analysis tools are called OLAP (On-Line Analytical Processing). OLAP is mainly
used to access the live data online and to analyze it.
 OLAP tools are designed in order to achieve such analyses on very large databases. OLAP
provides a user-friendly environment for interactive data analysis.
 In the multidimensional model, the data are organized into multiple dimensions and each
dimension contains multiple levels of abstraction. Such an organization provides the users
with the flexibility to view data from different perspectives. There exist a number of OLAP
operations on data cubes which allow interactive querying and analysis of the data.
 OLAP provides a user-friendly environment for interactive data analysis.
 OLAP operations exist to materialize different views of data, allowing interactive querying
and analysis of data.
 The three basic operations of OLAP are:
i. Roll-up or Drill-up
ii. Roll-down or Drill-down
iii. Slicing and Dicing
iv. Pivot

 Roll-Up/Drill-Up/Consolidation Operation:

o Drills with switching from a detailed to an aggregated level within same classification
hierarchy.
o Roll-up operation helps to dimension reduction.
o Roll-up is performed by climbing up a concept hierarchy for the dimension location.
o Example: week>month>quarter

Consider the following cube illustrating temperature of certain days recorded weekly:

Fig_7: Example.

Assume we want to set up levels (hot(80-85), mild(70-75), cold(64-69)) in temperature from the
above cube. To do this we have to group columns and add up the values according to the concept
hierarchy. This operation is called roll-up.By doing this we obtain the following cube:

Fig_8: Rollup.

The concept hierarchy can be defined as hot-->day-->week. The roll-up operation groups the data by
levels of temperature.

Roll-Up Operation
 Roll-Down/Drill-Down Operation:

o Switching from an aggregated to a more detailed level within the same classification
hierarchy.
o It is performed by the following ways:
 By stepping down a concept hierarchy for a dimension or introducing a additional
dimension.
 By introducing a new dimension.
o It navigates from less detailed data to more detailed data.
o It can be realized by either stepping down a concept hierarchy for a dimension or introducing
additional dimensions. Performing roll down operation on the same cube mentioned above:

Fig: Rolldown

The result of a drill-down operation performed on the central cube by stepping down a concept
hierarchy for temperature can be defined as day<--week<--cool. Drill-down occurs by descending
the time hierarchy from the level of week to the more detailed level of day. Also new dimensions
can be added to the cube, because drill-down adds more detail to the given data.

Roll-Down Operation
 Slicing Operation:

o The slice operation selects one particular dimension from a given cube and provides a new
sub-cube.
o This operation is used for reducing the data cube by one or more dimensions. The slice
operation performs a selection on one dimension of the given cube, resulting in a subcube.
o For example, in the cube example above, if we make the selection, tempera ture=cool we
will obtain the following cube:

Fig 0: Slicing.

o Figure 1 shows a slice operation where the sales data are selected from the central cube for
the dimension time, using the criteria time= 'Q2'.

Slice time='Q2’ C [quarter, city, product] = C [city, product]

Figure1: Slice operation


 Dicing Operation:

o The dice operation selects two or more dimensions from a given cube and provides a new
sub cube.
o This operation is also used for reducing the data cube by one or more dimensions.
o This operation is for selecting a smaller data cube and analyzing it from different
perspectives. The dice operation defines a subcube by performing a selection on two or more
dimensions.
o For example, applying the selection (time = day 3 OR time = day 4) AND (temperature =
cool OR temperature = hot) to the original cube we get the following subcube (still two-
dimensional):

Fig 11:Dice Operation

o Figure 2 shows a dice operation on the central cube based on the following selection criteria,
which involves three dimensions: (location= "Mumbai" or "Pune ") and (time = "Q1" or
"Q2").

Dice time='QI' or 'Q2' and location="Mumbai" or "Pune" C [quarter, city, product]


=C [quarter', city', product].
Where quarter’ and city' have truncated domains such as {Q1, Q2} and {Mumbai, Pune},
respectively.

Figure 2: Dice operation

 Pivot Operation:
o A visualization operation which rotates the data access in order to provide an alternative
representation.
o 3D>2D
o It rotates the data axes in view in order to provide an alternative presentation of data.
o Pivot groups data with different dimensions.
o Pivot (also called "rotate") is a visualization operation which rotates the data axes in order to
provide an alternative presentation of the same data. Ex: rotating the axes in a 3-D cube, or
transforming a 3-D cube into, a series of 2-D planes. .
o The below cubes shows 2D represntation of Pivot.

Figure 3: Pivot operation

WAREHOUSE SCHEMA:

o Schema is a logical description of the entire database.


o It includes the name and description of records of all record types including all
associated data-items and aggregates.
o Various data warehouse schemas are:
 Star Schema
 Snow Flake Schema
 Fact Constellation

 STAR SCHEMA:

 A star schema is a modelling paradigm in which the data warehouse contains a


large, single, central Fact Table and a set of smaller Dimension Tables, one for each
dimension.
 The Fact Table contains the detailed summary data. Its primary key has one key per
dimension.
 Each dimension is a single, highly denormalized table. Every tuple in the Fact Table
consists of the fact or subject of interest, and the dimensions that provide that fact.
 Each tuple of the Fact Table consists of a (foreign) key pointing to each of the Dimension
Tables that provide its multidimensional coordinates.
 It also stores numerical values (non-dimensional attributes, and results of statistical
functions) for those coordinates.
 The Dimension Tables consist of columns that correspond to the attributes of the
dimension. So each tuple in the Fact Table corresponds to one and only one tuple in each
Dimension Table. Where as, one tuple in a Dimension Table may correspond to more
than one tuple in the Fact Table. So we have a I:N relationship between the Fact Table
and the Dimension Table.
 The advantages of a star schema are that it is easy to understand, easy to define
hierarchies, reduces the number of physical joins, and requires low maintenance and very
simple metadata.
 For Example: Let us consider the "Employment" data warehouse. We have three
Dimension Tables and one Fact Table. The Star Schema for this example is shown in
below Figure.

Figure: Star schema

 SNOWFLAKE SCHEMA:

 Star schema consists of a single fact table and a single denormalized dimension table for each
dimension of the multidimensional data model.
 To support attribute hierarchies, the dimension tables can be normalized to create snowflake
schemas.
 A snowflake schema consists of a single fact table and multiple dimension tables. Like the
Star Schema, each tuple of the fact table consists of a (foreign) key pointing to each of the
dimension tables that provide its multidimensional coordinates.
 It also stores numerical values (non-dimensional attributes, and results of statistical
functions) for those coordinates. Dimension Tables in a star schema are denormalized, while
those in a snowflake schema are normalized.
 The advantage of the snowflake schema is as follows:
 A Normalized table is easier to maintain.
 Normalizing also saves storage space, since an un-normalized Dimension Table tends
to be large and may contain redundant information:
 However, the snowflake structure may be reducing the effectiveness of navigating across the
tables due to a larger number of join operations.

Figure: Snowflake Schema

An example of a snowflake schema for a company Deccan Electronics is given in Figure. It can be
seen that the dimension table for the items is normalized resulting in two tables namely, the item and
supplier tables.

 FACT CONSTELLATION SCHEMA:

 A Fact Constellation is a kind of schema where we have more than one Fact Table
sharing among them some Dimension Tables. It is also called Galaxy Schema.
 For example, let us assume that Deccan Electronics would like to have another Fact Table for
supply and delivery.
Figure: Fact Constellation Schema

 Advantages of Data Warehouse Fact Constellation Schema


o Different fact tables are explicitly assigned to the dimensions.
o Provides a flexible schema for implementation.
 Disadvantages of Data Warehouse Fact Constellation Schema
o Complexity of the schema involved because of several aggregations.
o Fact constellation solution is hard to maintain and support.

 DATAWAREHOUSING ARCHITECTURE:
 This structure can be visualized as 3-tier architecture.
 Tier 1 is essentially the warehouse server, Tier 2 is the OLAP-engine for analytical
processing, and Tier 3 is a client containing reporting tools, visualization tools, data
mining tools, querying tools, etc.
 There is also the backend process which is concerned with extracting data from multiple
operational databases and from external sources; with cleaning, transforming and integrating
this data for loading into the data warehouse server; and of course, with periodically
refreshing the warehouse.
 Tier 1 contains the main data warehouse. It can follow one of three models or some
combination of these. It can be single enterprise warehouse, or may contain several
departmental marts. The third model is to have a virtual warehouse.
 Tier 2 follows three different ways of designing the OLAP engine, namely ROLAP, MOLAP
and extended SQL OLAP.
Below Figure shows a typical data warehousing architecture.

OR

DataWarehouse Architecture

Data Sources:
 All the data related to any business organization is stored in operational databases, external
files and flat files.
 These sources are application oriented.
 E.g: Complete data of organization such as training detail, customer detail, sales,
departments, transactions, employee detail etc.
 Data present here is different formats. Contain data that is not well documented.
Bottom Tier: Data Warehouse server
o Data Warehouse server fetch only relevant information based on data mining (mining a
knowledge from large amount of data) request.
o E.g: Customer profile information provided by external consultants.
o Data is feed into bottom tier by some backend tools and utilities.
o Function performed by backend tools and utilities are:
 Data Extraction
 Data Cleaning
 Data Transformation
 Load
 Refresh
o Bottom tier contains:
 Data Warehouse
 Meta data repository
 Data Marts
 Monitoring and Administration

 Data Warehouse:
It is an optimized form of operational database contain only relevant information and provide fast
access to data.

 Meta data repository:


Meta data repository contains:
 Structure of data warehouse.
 Data names and definitions.
 Sources of extracted data.
 Algorithm used for data cleaning purpose.
 Sequence of transformation applied on data.

 Data Marts:
 Subset of data warehouse contain only small slices of data warehouse.
 E.g: Data pertaining to the single department.
 Two types of data marts:
 Monitoring & Administration:

Middle Tier: OLAP Server


o It represents the users a multidimensional data from data warehouse or data marts.
o Typically implemented using two models:

Top Tier: Front End Tools

o Query Tools: Point and click creation of SQL used in customer mailing list.
o Reporting Tools: Production reporting tools, report writers.
o Analysis Tools: Prepare charts based on analysis.
o Data mining Tools: Mining knowledge, discover hidden piece of information, new
correlations, useful pattern.

 WAREHOUSE SERVER:
o A data warehouse server is the physical storage used by a data warehouse system.
o Various processed data and other relevant information that comes from several applications
and sources are stored in a data warehouse server where it is organized for future business
analysis and user query purposes.
o There are three data warehouse models:
 Enterprise Warehouse
 Data Marts
 Virtual Data Warehouse
 ENTERPRISE WAREHOUSE:
 This model collects all the information about the subjects, spanning the entire
organization.
 It provides corporate wide data integration, usually from one or more operational
systems or external information providers.
 An enterprise data warehouse is a unified database that holds all the business
information an organization and makes it accessible all across the company.

 DATA MARTS:
 Data Marts are partitions of the overall data warehouse. If we visualize the data
warehouse as covering every aspect of a company's business (sales, purchasing, payroll,
and so forth), then a data mart is a subset of that huge data warehouse built specifically
for a department.
 Data marts may contain some overlapping data. A store sales data mart, for example,
would also need some data from inventory and payroll.
 There are several ways to partition the data, such as by business function or geographic
region. There are many alternatives to design a data warehouse.
 One feasible option is to start with a set of data marts for each of the component
departments. One can have a stand-alone data mart or a dependent data mart.
 The current trend is to define the data warehouse as a conceptual environment.
 The industry is moving away from a single, physical data warehouse toward a set of
smaller, more manageable, databases called data marts.
 The physical data marts together serve as the conceptual data warehouse. These marts
must provide the easiest possible access to information required by its user community.
o Stand-alone (Independent) Data Mart:
 Independent data marts, in contrast, are standalone systems built by drawing
data directly from operational or external sources of data or both.
 In a bottom-up approach a data mart development is “Independent” of
enterprise data warehouse.
 This approach enables a department or work-group to implement a data mart
with minimal or no impact on the enterprise's operational database.

Independent Data Mart

o Dependent Data Mart:


 Dependent data marts draw data from a central data warehouse that has already
been created.
 In a top-down approach a data mart development “dependents” on enterprise
data warehouse.
 This approach is similar to the stand-alone data mart, except that management of
the data sources by the enterprise database is required.
 These data sources include operational databases and external sources of data.
Dependent Data Mart

Differences between Enterprise data warehouse & Data Marts:

Enterprise Data Warehouse:


o It is an integration of multiple subjects.
o It stores enterprise specific business information.
o Design for top management (CEO, board of director).

Data Marts:
o It defines a single subject.
o It stores department specific information.
o Design for middle management users.

 VIRTUAL DATAWAREHOUSE:
 When end-users access the “system of record” (the OLTP system) directly and generate
“summarized data” reports and thereby given the feel of a “data warehouse”, such a data
warehouse is known as a “Virtual data warehouse”.
 This model creates a virtual. view of databases, allowing the creation of a "virtual warehouse"
as opposed to a physical warehouse.
 In a virtual warehouse, we have a logical description of all the databases and their structures,
and individuals who want to get information from those databases do not have to know
anything about them
 This approach creates a single "virtual database" from all the data resources. The data
resources can be local or remote.
 In this type of a data warehouse, the data is not moved from the sources. Instead, the users are
given direct access to the data. The direct access to the data is sometimes through simple SQL
queries, view definition,
 The virtual data warehouse scheme lets a client application access data distributed across
multiple data sources through a single SQL statement, a single interface.
 All data sources are accessed as though they are local users and their applications do not even
need to know the physical location of the data.
 A virtual database is easy and fast, but it is not without problems. Since the queries must
compete with the production data transactions, its performance can be considerably degraded.
Since there is no metadata, no summary data or history, all the queries must be repeated,
creating an additional burden on the system.
 This model creates a virtual view of databases, allowing the creation of a virtual warehouse.
 In a virtual warehouse we have a logical description of all the databases and their structures.
The data resources can be either local or remote.
 In this type of data warehouse, the data is not moved from the sources, Instead the users are
given the direct access to the data.

 METADATA:
 Meta data is a data that describes other data.
 Metadata summarizes basic information about data, which can make finding and working
with particular instances of data easier.
 Metadata can be created manually, or by automated information processing.
 Manual creation tends to be more accurate, allowing the user to input any information
they feel is relevant or needed to help describe the file.
 Automated metadata creation can be much more elementary, usually only displaying
information such as file size, file extension, when the file was created and who created
the file.
 Metadata serves to identify the contents and location of data in the warehouse.
 Metadata is a bridge between the data warehouse and the decision support application.
 In addition to' providing a logical linkage between data and application, metadata can
isolate access to information across the entire data warehouse, and can enable the
development of applications which automatically update themselves to reflect data
warehouse content changes.
 Metadata is needed to provide a definite interpretation. Metadata provides a catalogue of
data in the data warehouse and the pointers to this data.
 Metadata may also contain: data extraction/ transformation history, column aliases, data
warehouse table sizes, data communication/modelling algorithms and data usage
statistics.
 Metadata is also used to describe many aspects of the applications, including hierarchical
relationships, stored formulae, whether calculations have to be performed before or after
consolidation, currency conversion information, time series information, item description
and notes for reporting, security and access controls, data update status, formatting
information, data sources, availability of pre-calculated summary tables, and data storage
parameters.
 In the absence of this information, the actual data is not. In simple, Metadata is "data
about data" definition.
 Business metadata, which includes business terms and definitions, data ownership
information, and changing policies.

 Types of Meta Data


o Build-Time Meat Data
o Usage Meta Data
o Control Meta Data
Build-Time Meta Data:
o Whenever we design and build a warehouse, that metadata that we generate can be called
build-time meta data.
o This metadata links business and warehouse terminology and describes the data's technical
structure.
o It is the most detailed and exact type of meta data and is used extensively by warehouse
designers, developers and administrators.
o It is the primary source of the meta data used in the warehouse.

Usage Meta Data:


o A Metadata which is derived from build-time metadata, when the warehouse is in production,
is called as usage metadata.
o Usage metadata is an important tool for users and data administrators.

Control Meta Data:


o This metadata is used by the databases and other tools to manage their own operations.
o For example, a DBMS builds an internal representation of the database catalogue for use as a
working copy from the build-time catalogue.
o This representation functions as control metadata. Most control metadata is of interest only to
systems programmers.

 OLAP ENGINE:
o The main function of the OLAP engine is to present the user a multidimensional view of the
data warehouse and to provide tools for OLAP operations.
o If the warehouse server organizes the data warehouse in the form of multidimensional arrays,
then the implementational considerations of the OLAP engine are different from those when
the server keeps the warehouse in a relational form.
o There are three options of the OLAP engine:
 Specialized SQL Server:
 This model assumes that the warehouse organizes data in a relational structure and
the engine provides an SQL-like environment for OLAP tools.
 The main idea is to exploit the capabilities of SQL. We shall see that the standard
SQL is not suitable for OLAP operations.
 However, some researchers, (and some vendors) are attempting to extend the
abilities of SQL to provide OLAP operations.
 ROLAP( Relational OLAP):
 ROLAP works with data that resides in relational database where the base data
and dimension tables are stored as relational tables.
 This model permits multidimensional analysis of data as this enables users to
perform a function equivalent to that of the traditional OLAP slicing and dicing
features.
 This is achieve through use of SQL reporting tool to extract or ‘query’ data
directly from the data warehouse.
 Each action of slicing and dicing is equivalent to adding a "WHERE" clause in the
SQL statement.
 The ROLAP approach begins with the premise that data does not need to be
stored multidimensionally to be viewed multidimensionally.
 Normally, the ROLAP engine formulates optimized SQL statements that it sends
to the RDBMS server.
 It then takes the data back from the server, reintegrates it, and performs further
analysis and computation before delivering the finished results to the user.
 Two important features of ROLAP are
 Data warehouse and relational database are inseparable
 Any change in the dimensional structure requires a physical reorganization of
the database, which is too time consuming. So ROLAP tool is the only
appropriate choice.
 Advantages of ROLAP :
 Can handle large amounts of data: The data size limitation of ROLAP
technology is the limitation on data size of the underlying relational database.
 ROLAP itself places no limitation on data amount.
 Disadvantages of ROLAP :
 Performance can be slow: The query time can be long if the underlying data
size is large.
 Limited by SQL functionalities: It is difficult to perform complex calculations
using SQL

Figure:ROLAP

 MOLAP(Multi-Dimensional OLAP):
 The third option is to have a special purpose Multidimensional Data Model for the
data warehouse, with a Multidimensional CLAP (MOLAP) server for analysis.
MOLAP servers support multidimensional views of data through array-based data
warehouse servers.
 They map multidimensional views of a data cube to array structures.
 Note: The advantage of using a data cube is that it allows fast indexing to
precompute summarized data. As with a multidimensional data store, storage
utilization is low, and MOLAP is recommended in such cases.
 Advantages of MOLAP:
 Excellent performance: MOLAP cubes are built for fast data retrieval,
and are optimal for slicing and dicing operations.
 Can perform complex calculations: All calculations have been pre-
generated when the cube is created. Hence, complex calculations are easier
to generate and return the result quickly.

 Disadvantages of MOLAP:
 Limited in the amount of data it can handle: Because all calculations are
performed when the cube is built, it is not possible to include a large
amount of data in the cube itself.
 Requires additional investment: Cube technology are often proprietary
and do not already exist in the organization. Therefore, to adopt MOLAP
technology, chances are additional investments in human and capital
resources are needed.
ROLAP vs MOLAP:
The following arguments can be given in favour of MOLAP:
 Relational tables are unnatural for multidimensional data.
 Multidimensional arrays provide efficiency in storage and operations.
 There is a mismatch between multidimensional operations and SQL.
 For ROLAP to achieve efficiency, it has to perform outside current relational systems,
which is the same as what MOLAP does.
The following arguments can be given in favour of ROLAP:
 ROLAP integrates naturally with existing technology and standards.
 MOLAP does not support ad hoc queries effectively, because it is optimized for
multidimensional operations.
 Since data has to be downloaded into MOLAP systems, updating is difficult.
 The efficiency of ROLAP can be achieved by using techniques such as encoding and
compression.
 ROLAP can readily take advantage of parallel relational technology.

 DATA WAREHOUSE BACKEND PROCESS:


o Data warehouse systems use backend tools and utilities to populate and refresh their data.
o These tools and facilities include the following functions:
 Data extraction: Which gathers data from multiple, heterogeneous, and external sources;
 Data cleaning, This detects errors in the data and rectifies them when possible;
 Data transformation, This converts data from legacy or host format to warehouse format.
 Load, which sorts, summarizes, consolidates, computes, views, checks integrity, and
builds indices and partitions; and
 Refresh, which propagates the updates from the data sources to the warehouse.

DATA EXTRACTION:
 Data extraction is the process of extracting data for the warehouse from various sources. The
data may come from a variety of sources, such as production data, legacy data, internal
office systems, external systems, metadata.
DATA CLEANING:
 Data cleaning is essential in the construction of quality data warehouses.
 The data cleaning techniques include
1. Using transformation rules, e.g., translating attribute names like 'age' to 'DOB';
2. Using domain-specific knowledge;
3. Performing parsing and fuzzy matching, e.g., for multiple data sources, one can
designate a preferred source as a matching standard, and
4. Auditing, i.e., discovering facts that flag unusual patterns.
DATA TRANSFORMATION:
 The sources of data for data warehouse are usually heterogeneous. Data transformation is
concerned with transforming heterogeneous data to a uniform structure so that the data can
be combined and integrated.
 Convert data from legacy or host format to warehouse format.
 The transformation of data into a desired state include functions such as data formatting,
splitting data, joining data, creating rows and columns, using lookup tables or creating
combinations within the data.
LOADING:
 A loading system should also allow system administrators to monitor the status, cancel,
suspend, resume loading or change the loading rate, and restart loading after failures without
any loss of data integrity.
 There are different data loading strategies: Batch loading, Sequential loading, Incremental
loading.
REFRESH:
 When the source data is updated, we need to update the warehouse. This process is called the
refresh function.
 When the source data is updated, we need to update the warehouse this process called the
refresh function.
 Refresh policy is set by the data administrator, based on user needs and data traffic.

You might also like