Data Warehousing & Data Mining: Unit-1

Download as doc, pdf, or txt
Download as doc, pdf, or txt
You are on page 1of 24

DATA

WAREHOUSING &
DATA MINING

Unit-1
SYLLABUS - DATA WAREHOUSING AND DATA MINING
(BE-IT 401E MDU - 2006-09)

Class work: 50 marks


LTP Exam: 100 marks
3 1 - Total: 150 marks
Duration of exam: 3 hours

Unit-I: Data Warehousing Definition usage & trends, DBMS vs data warehouse, Data marts,
Meta data, Multidimensional data mode, Data cubes, Schemas for Multidimensional Database:
Star, snowflake and fact constellations.

Unit-II: Data warehouse process & architecture, OLTP vs OLAP, ROLAP vs MOLAP, types of
OLAP Server, 3-tier data warehouse architecture, distributed and virtual data warehouse, data
warehouse manager.

Unit-III: Data ware house implementation, computation of data cubes, modelling, OLAP data,
OLAP queries manager, data warehouse back end tools, complex aggregation at multiple
granularities, tuning and testing of data warehouse.

Unit-IV: Data mining definition and task, KDD verses data mining, tools and applications.

Unit–V: Data mining query languages, data specification, specifying knowledge, hierarchy
specification, pattern presentation and visualisation specification, data mining languages and
standardisation of data mining.

Unit –VI: Data mining techniques: Association rules, clustering techniques, decision tree,
knowledge discovery through neural networks & genetic Algorithm, Rough sets and fuzzy
techniques.

Unit –VII: Mining complex data objects, spatial database, multimedia databases, Times series
and sequence data, mining text data bases and mining world wide web.

2
UNIT-I:
DATA WAREHOUSING

Syllabus: Data Warehousing Definition usage & trends, DBMS vs data warehouse, Data marts,
Meta data, Multidimensional data mode, Data cubes, Schemas for Multidimensional Database:
Star, snowflake and fact constellations.

WHAT IS DATA WAREHOUSE?

A data warehouse is a repository of subjectively selected and suitable operational data, which
can successfully answer any adhoc, complex, statistical or analytical queries.

The data warehouse is more than just data; it is also the processes involved in getting that data
from sources to table, and in getting the data from table to analysts.

It is situated at the centre of a decision support system (DSS) of an organization and contains
integrated historical data; both summarized and detailed information, common to the entire
organization.

Data Warehousing? It is a process of constructing and using data warehouse. The construction
of data warehouse involves data cleaning and data integration, which is prelude to data mining
operations, Data warehousing provides architectures and tools for business executives to
systematically organize, understand, and use their versatile databases to make strategic decision.
They support information by providing a solid platform of consolidated historical data for
analysis.

OLAP Data Warehouse


Data source

Results & report for Data mining


Decision-making

According to W.H. Inman, a leading architect in the construction of data warehouse systems, “A
data warehouse is a i) subject-oriented (customer, products. Sales etc.), ii) non-volatile (non-
changeable physical separate store of data accessed for queries), iii) time-varying (stored for 5,
20 or more years & used for comparison, trends and forecasting), iv) integrated (of multiple
heterogeneous sources such as relational databases, ASCII files and OLTP files) collection of
data in support of the management’s decision-making process.

i) Subject oriented:

A data warehouse is organized round major subject such as customer, products, sales, etc. Data
are organized according to subject instead of application. For example, an insurance company
using a data warehouse would organize their data by customer, premium, and claim instead of by
different products(auto, life, etc).The data organized by subject obtain only the information
necessary for the decision support processing.
3
ii) Integrated:

A data warehouse is usually constructed by integrating multiple, heterogeneous sources such as


relational databases, flat files, and OLAP files, When data resides in many separate application
in the operational environment, the recording of data is often inconsistent.

iii) Time Variant:

Data are stored in a data warehouse to provide a historical perspective. Every key structure in the
data warehouse contains, implicitly or explicitly, an element of time. The data warehouse
contains a place of sorting data that are 5 to10 years old, or older, to be used for comparison,
trends and forecasting.

iv) Non Volatile

A data warehouse is always a physically separate store of data, which is transformed from the
application data, found in the appropriate environment. Due to this separation, data warehouses
do not require transaction processing, recovery, concurrency control, etc. The data are not
updated if changed in any way once they enter the data warehouse, but are only loaded,
refreshed and accessed for queries.

Utility (examples): A data warehouses is no more than a collection of the key pieces of
information used to manage and direct the business for the most profitable outcome. This could
be anything from deciding the level of stock in a superstore warehouse, for building a database
of customers for various sale promotion schemes or for strategic decision-making on major
market segments and company profitability.

Data warehouse along with OLAP tools are being increasingly development to analyse historical
or time series data to identify past patterns or tends which may be useful in forecasting the
future. In addition, the unknown association and relationships of data items in data-intensive
areas, it can be found out by using various data mining techniques.
Data warehousing enables
• Easy organization and
• Maintenance of large data in addition to
• Fast retrieval and
• Analysis and
• In-depth information required from time to time.

There is a dearth of value adding information describing detailed solutions to solve the business
and technical issues involved in delivering data warehouse.

Components of Data warehouse:

• Reference & Transaction data (comes from operational or source data and kept as
Database data)
• Derived data (from transaction or reference data through OLTP)
• Renormalized Data (bases for OLAP)

4
On the risks and issues associated with data warehousing the main points are:

• Identifying the technical, architecture and infrastructure problems, with guidance on


solving them.
• Appropriate design for the database system managers, and for construction of end-user
queries.
• Constructing a realistic project plan, with minimum risk of failure.

Warehouse Management Tools

Data warehouse architecture usually provides set of management tools which include load
manager, warehouse manager, query manager.

In addition, other management tools like server manager and network manager must support
data warehouse.

Data warehousing is very important for the following specialist:


 Technical architects
 Systems engineers and analyst programmers
 Database designers
 Database administrations
 System manager and operating manager

Data warehousing is oriented towards the following subjects in a corporate house:


i) Customer
ii) Product
iii) Transaction or
iv) Activity Policy
iv) Claim
v) Account & Finance

DATA WAREHOUSE USAGE

Data warehouse is used in a wide range of applications. Business executives in almost every
industry use the collected, integrated, preprocessed and stored in data warehouse and data marts
to perform data analysis and make strategic decisions. In many firms data warehouse are used as
an integral part of plan, executive asses or ‘closed–loop’ feedback system for enterprise
management. Data warehouse are used extensively in

• banking & financial services


• consumer goods and retail distribution sector
• Controlled manufacturing, such as demand based production.

Data warehouse is process that evolves gradually within enterprises. Typically the longer a data
warehouse has been in-use, there it will have evolved. Initially, the data warehouse is, mainly
used for generating reports and answering predefined queries. Eventually, it is used to analyze

5
summarized and detailed data, in multidimensional analysis and sophisticated slice and dice
operations.

Finally the data warehouse may be employed for knowledge discovery and strategic decision-
making categorized into access and retrieval tools, database reporting tools, data analysis tools,
and data mining tools.

OPERATIONAL DATA & OPERATIONAL DATABASES

Operational Data: The data warehousing process begins with your operational data. It is this data
that you must analyze and evaluate; you must cleanse and translate it, and ultimately
will populate your warehouse with it. The term operational data is most often used to
refer to data that is generated by your online transaction processing (OLTP) system. As
you can see in Figure 1, the definition is expanded to include any data source that you
use to maintain the day-to-day operations of your business. Operational data can include

• Purchased data, such as mailing list or demographics information that is provided to you
on a CD-ROM or made available on another companies extranet.
• Legacy data that has been collected and stored in relational or non-relational databases.
(Be careful about including legacy data in your data warehouse, as it may no longer be
relevant to the way you do business.)
• Data collected and managed by your Front Office Automation (FOA) systems. These
systems are used to manage and leverage relationships with your customers.
• Data collected and managed by your Enterprise Resource Planning (ERP) systems. This
section of data has become the most rapidly growing contributor to operational data.

Operational data is operated on day to day basis. It is Backbone systems of any enterprise, like
our ‘order entry’ inventory. Manufacturing, payroll and accounting systems are operational data.
These are the raw material for building Operational databases which help in building
Management Information System and help in providing ad hock enquire.

Operational databases help in building enterprise informational systems, which are used for
planning, forecasting and financial analysis. The knowledge-based functions are informational
systems. Information systems have to do with analyzing of operational data and making
decisions, often major decisions about how the enterprise will operate, now and in the future.
And not only do operational database systems have a different focus from operational data they
often have a different scope. Where operational data needs are normally focused upon a single
area, operational databases often span a number of different area and need large amounts of
related operational data.

The data warehouse is by-product operational data and operational databases. In the last few
years, Data warehousing has grown rapidly from a set of related ideas into architecture for data
delivery for enterprise end user computing.

DATA MARTS

Data marts is a subset of data warehouse which deals with a subject or a department and can be
viewed as a solution because the data stored in the data mart is designed to answer every unique
business problem. Because this data has been scrubbed and organized to support a single
6
business function, users find it easy to traverse the data and to create and run analytical processes
using industry standards like ODBC.

 Data flow from Data warehouse to various departments for their customised DSS usage is
called data mart and is a subset of data warehouse which deals with a subject or a
department and can be viewed as a solution because the data stored in the data mart is
designed to answer avery unique business problem of the department without referring to
data warehouse.
 It can also be classifies into two groups: multidimensional (MDDB OLAP or MOLAP)
and relational OLAP (or ROLAP).

 In MDDB data mart, the numeric data which is basically multidimensional in its nature
can be sliced and diced in free fashion (free from modelling constraints of DBMS (as
RDBMS).

 On the other hand, ROLAP data marts may contain both text and numeric data and are
supported by RDBMS.

 The various rules & processes for building a Data Warehouse are also applicable to Data
Marts.

 Independent data marts are developed from independent operational databases as well as
other sources of information or data storage.

 Like enterprise warehouse, contains summarised data in multidimensional schema.

 Can reside also on smaller departmental systems (Windows/ NT etc.)

 Implementation cycle is weekly or fortnightly.

 It is distributed enterprise wide and used by all data miners.

META DATA

Metadata, in its broadest sense is defined as "data about data” and describes the entire
application environment. Metadata is to the data warehouse what the card catalogue is to the
traditional library.

 It serves to identify the contents and location of the data in the warehouse
 Metadata is a bridge between the data warehouse and the decision support application.
 In addition to providing a logical linkage between data and application, metadata can
pinpoint access to information across the entire data warehouse, and can enable the
development of applications which automatically update themselves to reflect data
warehouse content changes. Metadata is the lifeblood of your data warehouse.
 The metadata supports the understanding, navigation, and use of the data that is
accessible through your warehouse.
 Without metadata, the data resources of your organization will be under utilized, costing
you time and money.
7
 Metadata facilitates repeatability, ease of maintenance, and programmatic updates.
 It also helps answer questions about your information facts like:
• What does this information fact mean?
• When was it added to the warehouse?
• Where it is stored?
• Where did it originate?
• How long is the information valid?
• What does this field mean in business terms?
• Which business process does this set of queries support?
• When did the job to update the customer data in our data mart last run?
• Which file contains the product data, where does it reside, and what is its detailed
structure?
It is important that your metadata should be common to and shared by your decision support
applications and your operational data. Note that the metadata representation in this Figure has
an arrowhead on both ends. If your metadata is common and shared, it can reflect decisions made
from the analysis of your warehouse data.

Technically a metadata repository should contain:-

• A description of the structure of the data warehouse e.g; warehouse schema, view
dimension, hierarchies and derives data definitions, data marts location and content. Etc.
• Operational metadata, such as data linkages, currency of the data and monitoring
information (warehouse, usage statistics, error reports, and its trails).
• The summarization process which include dimension, definition, data on granularity,
partitions, summary measure, aggregation, summarization etc.
• Details of data sources which include source databases and their contents, gateway
descriptions, a data partitions, data extractions, clearing, transformation rules and
defaults.
• Data related to system performance, which include indices and profiles that improve data
access and retrieval performances.
• Rules for timing and scheduling of refresh, update, and replication cycles.
• Business metadata, which include business terms and definitions, data ownership
information, and changing policies

TYPES OF META DATA

Build-Time Metadata

Whenever we design and build a warehouse, the metadata that we generate can be termed as
build–time metadata. This metadata links business and warehouse terminology and describe the
data’s technical structure. It is the most detailed and extracts types of Meta data and is used
extensively by warehouse designers, developers, and administrators. It is the primary source of
most of the metadata used in the warehouse.

Usage Metadata

When the warehouses in production, usage metadata, which is derived from, build time
metadata, is an important tool for users and data administrators. This metadata is used differently
from build-time metadata, and its structure must accommodate this fact.
8
META DATA IN THE DISTRIBUTED WAREHOUSE:

Metadata plays a very important role across the distributed corporate data warehouse. It is
through metadata that coordination of the structure of data is achieved across the many different
locations where the data warehouse is found. Not surprisingly metadata provides the vehicle for
the achievement of uniformity and consistency.

OPERATIONAL PROCESSING DATA WAREHOUSE

1 Technologically distributed data warehouse: The data


warehouse environment will hold a lot of data, and the volume of data will be distributed
over multiple processors. Logically there is a single data warehouse, but physically there
are many data warehouses that are all tightly related but reside on separate processors.
This configuration can be called the technologically distributed data warehouse.

2 Independently evolving distributed data warehouse: The


data warehouse environment grows up in an uncoordinated manner. First one data
warehouse appears, then another and so on this process goes on. The lack of coordination
of the growth of the different data warehouse is usually a result of political and
organizational differences. This case can be called the independently evolving distributed
data warehouse.

EMERGING DATA WAREHOUSE TRENDS BEYOND 2000

The growth of the Web has reset expectations for end-users' access to corporate data. "Broad
new communities of end-users with diverse needs are now putting pressure on IT to deliver high-
performance access to corporate information over the Web," says Susan Andre, MD of local data
mart solutions provider Sagent SA.

Firstly, in the future customers will need to perform more sophisticated data analysis than they
currently do, Andre says. "This will result in the need to provide users with a more user-friendly
method to answer their questions which are typically of a statistical nature. Companies will need
to demystify statistical analysis and make it less intimidating."

Companies need to provide easy to use Web-enabled analysis applications to all decision-makers
throughout the enterprise - from sales to marketing, to finance and human resources. "They must
go beyond simple query capabilities and move to the next level of answering sophisticated
questions, be they statistical, data mining or modelling, in a Web-based point-and-click data
analysis environment."

Traditional statistical tools have been difficult to use and are not geared for business end-users
but more for expert users, Andre argues, "thereby distancing business managers from critical
decision support techniques such as the ability to do trend analysis". On the contrary, Andre
says, by providing data analysis tools through an easy-to-use graphical user interface, users can
analyse and interpret the results of their own business data.

For example, a sales manager will be able to use these tools to more accurately forecast sales
based on current and past sales figures as well as correlation analysis looking into the future.
9
Marketing managers will better relate the impact of marketing campaigns to company sales,
while e-commerce managers will use data analysis to identify shared characteristics of the
company's most profitable online customers.

"Information will now be readily available for managers to analyse and interpret quickly for
decision-making," Andre points out. "This compatibility will be a competitive requirement for e-
commerce managers. The decision-making process will improve due to the availability of
information.

"Secondly, in the future more organisations will build Web information applications operating in
conjunction with data warehouses/marts," Andre maintains. "A Web information system captures
and integrates business information stored in the data warehouse/marts, groupware systems and
Web servers. This information will be available via Web browsers."

As Web use increases, Andre also foresees significant growth in the use of tools enabling users
to subscribe to the information they need, and to have it delivered to them from a Web
information store at predefined intervals. However, working with multiple vendors and investing
limited IT resources to integrate disparate software products could negate the cost and time
benefits of bringing data warehouse/marts to the Web, Andre cautions.

Consequently, she says there is a need for single vendors and service organisations with tightly
integrated, high-performance solutions for deploying data warehouses to the Web.

TRENDS IN DATA MARTS

Although the above touches on some of the trends in individual components of the data
warehouse, the most pronounced trend in the last year does not relate to the data warehouse
itself, but to its child, the data mart.

"One of the major trends in the last 12 months is the emphasis on the data mart instead of data
warehouse," said Percy. "A lot of the vendors have picked up the idea of data marts, but lots of
these data marts can turn out to be dead ends."

Data warehouses by definition take the enterprise view, which involves IT leadership and a
potentially protracted planning process. Data marts deal with departmental data which is well
understood, familiar to those who deal with it every day and much more limited in scope.
Moreover, some vendors are offering shrink wrapped data mart solutions, which make it
relatively easy to implement one.

However, problems can arise when the enterprise begins to build a data warehouse, either
because IT has decided a data warehouse serves a strategic purpose or because the data mart has
grown. In either case, Gartner found users sometimes discover that assumptions used when
building the data mart no longer hold for the data warehouse and all earlier work has to be
scrapped.

Although data marts have a vital role to play, to be truly effective they must be done in the
context of an overall data warehouse strategy. Developing that strategy takes work and enterprise
buy in. unfortunately, this planning must be done quickly, because forward looking departmental
managers will not wait forever before building their own data marts.

10
Mike Schiff, the executive director of Advanced Decision Support for Oracle Government, feels
data warehouses have moved from buzzword to something that's actually being done. At the
same time, he's also observed the data mart trend.

"Data marts are coming out as part of an overall strategy," said Schiff. "It's true that any agency
or department knows their own data, the guy in purchasing can build his own data mart. And
there's always the feeling that it's the other guy who is messing things up. To address this kind of
problem we need cross-functional committees. I believe we will see IS recognizing this, and that
with a bit of up front effort and by working with the business units, they can set up an
architecture into which data marts can fit. I see this as a win-win situation where IS will help the
department."

Schiff also sees opportunities in which some departments may profit from an isolated data mart
which does not require the enterprise-wide definitions. But this depends on delineation between
the independent and dependent data mart.

According to Pat Mansfield, president of Bull's U.S. Systems and Services Group, Bull selected
the data warehouse market as one of their key government markets two years ago because of
their expertise and because of pressures on government at all levels.

"People don't buy things, they buy solutions," Mansfield noted. "Government was going through
many changes in their structure. From an IT standpoint, they had an older infrastructure in place,
but were being asked to provide more service to customers for a lot less money. We viewed the
data warehouse as a key way to do a lot more with a lot less."

Mansfield said an overall data warehouse strategy is essential to the success of any data
warehouse/data mart initiative.

"Until you build a data warehouse with some size to it, you can't conceptualize what you can do
with it," Mansfield said. "If you have not done this before, you believe you can do a relatively
small thing. The reason you do that is you can't conceptualize where this thing is going to be in a
year from now. You only look at it as if you only need current information and last year's
information. Customers always tend to size a smaller system and then a year later recognize 'this
is never going to get me where I want to go.' The only way they can conceptualize the thing is to
build a model."

Toby Younis, director of Value Systems Engineering for Sybase, has seen the use of data
warehouses and data marts grow in response to several market forces.

"As businesses and government agencies distribute their businesses more and more, you get
adventurous managers out there who see opportunities," said Younis. "This process is
encouraged by the existence of readily available technology and the pressure to pursue smaller
market segments to gain and sustain a competitive advantage."

Given these circumstances, local managers may start to build their own data marts. That runs
counter to Younis' guideline for most effectively building them "think globally and act locally."
His philosophy requires that IT management take the initiative to do the data warehouse design
and give direction to departmental managers.

Younis also cautions against taking a limited view of how to use a data warehouse. He said they
are often not as successful as they should be because they are designed to simply "run the
11
business." The real value of the data warehouse -- the thing that differentiates it from standard
decision support systems -- is using it as a tool to "grow the business," to find untapped markets
or existing markets which need new or different products.

"Government serves a small number of people with tangible products," Younis said. "It serves
many more with 'transparent products' such as roads or safety signs. We all benefit equally from
those products. Because of this, many people out there -- those who aren't served directly -- start
questioning the value of government. This is why government needs to start providing products,
services and capabilities to the vast majority. If the taxpayer doesn't see what they are getting for
their tax dollar, they want you to downsize."

The availability and growth in the Internet can help tackle this problem.

"The demographics of the Internet matches those who are most critical of the government,"
Younis pointed out. "Those are the same people who only visit government once a year -- to
pay their taxes. If I was in government and realized that my most critical constituency is on
the Internet, I'd go looking for them."

Younis believes that data warehousing, done correctly, facilitates that search and can help
provide services or products to those previously unreachable publics. Data marts, while tied
to an enterprise wide data warehouse strategy, should serve a similar role in the department,
according to Younis.

"The data mart should only exist because they have information that isn't available from the
data warehouse," said Younis. "I believe 95 percent of the data anybody needs has been
digitized someplace. A successful data mart will take 80 percent of its information from the
data warehouse and another 20 percent from sources other than its data warehouse. The 80
percent guarantees I stay within the business' or agency's core competency, and the 20
percent is where the opportunity lies."

Younis sees some government agencies beginning to use data warehousing technology to
"grow the business," but hopes to see more in the coming years.

The data warehouse market is still fluctuating. Although vendor-marketing messages have
sometimes been mixed, there does seem to be general agreement that both marts and
warehouses are important parts of a coordinated IT strategy. And while the quick ramp up
time for data marts can be appealing, they provide the greatest return on investment when
done under the umbrella of the coordinated, enterprise wide strategy needed to build a data
warehouse whether or not a data warehouse exists.

Moreover, it is important for data warehouse designers to look at data warehouses as a way
to identify and provide services and products to the vast majority who have little to no
contact with government beyond paying taxes and renewing their drivers’ licenses.

12
WAREHOUSE SCHEMA

Let us consider the “Employment” data warehouse. We have three Dimension Tables and one
fact Table.

The Star Schema for this example is shown

FACT TABLE
Sex-key
Time-key
Profession-
key
DIMENSION(SEX) TABLE
DIMENSION(TIME) TABLE
Sex-key profession
Time-key
Year Quarter
Month day

DIMENSION (PROFESSION) TABLE

Profession-key
Profession-
Class
Title Level
Discipline

13
SNOWFLAKE SCHEMAS
Fact Table

Time-key
Item-key
Dimension (item) Table
Location
Item-key key Dimension (time) Table
Item-name
Brand Type Time- key
Supplier key Year
Quarter
Month
day

Dimension (supplier) Table


Dimension(location) Table Dimension (city) Table
Location
Supplier –key key City-key
Supplier-name Street City-name
Supplier address City-key
Supplier- type State
Country
Pin-code

14
FACT CONSTELLATION
Fact- 1: Supply

Dimension 1–key Fact 2: Delivery


Dimension 2-key
Dimension 3-key
Summary Dimension 2-
Units/cost key
Dimension 3-
key
Agent
Dimension 4-
Dimension 1 key
Schema

Dimension 3
Schema
Time Item Item
Dimension 2
Dimension 4
Schema
Schema

15
Data warehouse Architecture
META DATA

Cleaning,
Integration DW
etc. ETC SERVERR

DATA MARTS

Background Process

The warehouse severs sits at the core of the architecture described above. We shall discuss
different models of the warehouse server. As mentioned earlier, there are three data warehouse
models.

This model collects all the information about the subjects, spanning the entire organization. It
provides corporate-wide data integration, usually from one or more operational systems or
external information providers. An enterprises data warehouse requires a traditional mainframe.

A deeper look are what metadata is expanding the basic” data about data” definition- reveals the
following, metadata, in its broadest sense, defines and describes the entire application
environment. It answer such questions as

• What does this field mean in business terms?


• Which business process does this set of queries support?
• When did the job to update the customer data in our data mart last run?
• Which file contains the product data, where does it reside, and what is its detailed
structure?

16
Data Warehouse Backend Process

Data warehouse systems use backend tools and utilities to populate and refresh their data. These
tools and facilities include the following functions
(1) Data extraction , which gathers data from multiple, heterogeneous, and external sources;
(2) Data cleaning which detects errors in the data and rectifies them when possible;
(3) Data transformation, which converts data from legacy or host format to warehouse
format;
(4) Load, which sorts, summarizes, consolidates, computes, views, checks integrity ,and
builds indices and partitions; and
(5) Refresh, which prorates the updates from the data sources to the warehouse.
sources.
The data may come from a variety of sources, such as

• Production data,
• Legacy data,
• Internal office systems,
• External systems
• Metadata.

Data Cleaning

Since data warehouse are used for decision making, it is essential that the data in the
warehouse be correct. However, since large volumes of data heterogeneous s sources are
involved, there is a high probability of errors in the data. Therefore, data cleaning is essential
in the construction of quality data warehouses. The data cleaning technique include

• Using transformation rules, e.g. , translating attribute name like ‘age’ to ‘DOB’
• Using domain –specific knowledge;
• Performing parsing and fuzzy matching, e.g., for multiple data sources one can designate
a preferred sources as a matching standard , and
• * Auditing, i.e., discovering facts that flag unusual patterns.

Data Transformation

The sources of data for data warehouse are usually heterogeneous. Data transformation is
concerned with transforming heterogamous data to a uniform structure so that the data can be
combined and integrated.
Loading

Since a data warehousing integrates time—varying data from multiple sources, the volumes
of data to be loaded into a data warehouse can be huge. Moreover, there is usually a
insufficient time interval when the warehouse can be taken off-line and when loading data, n
indices and summary tables need to be rebuilt.

• Batch loading
• Sequential loading
• Incremental loading
17
Refresh

When the source data is updated, we need to update the warehouse. This process
Is called the refresh function, Determine how frequently to refresh is an important issue, One
extreme is to refresh on every update. This is very expensive, however, and is normally only
necessary when OLAP queries need the most current data, such as Active Data Warehouse, for
example, an up-to-the minute stock quotation. A more realistic choice is to perform refresh
periodically. Refresh policies should be set by the data administrator, based on user needs and
data traffic.

The misunderstood OLAP Engine

There are three fundamental misconceptions about OLAP engines.

• OLAP servers can perform data warehousing functions.


• No. OLAP engines build relational cubes that provide the ability tope form
multidimensional analysis on a given data set. They are completely inadequate for many
tasks commonly associated with data warehouse, such as historical archiving.
• OLAP engines can cleanse and manipulate data being loaded.

FACTS about OLAP

No. OLAP servers focus on providing multidimensional analysis. Most available products
emphasize the OLAP functionality and leave the data preparation to the user.
OLAP engines store the data in a format open to other tools.

DATA MARTS

From a data warehouse, data flows to various departments for their customized data support
system (DSS) usage. These individual department components are called data marts. In other
words, a data mart is a body of DSS data for a department that has an architecture foundation of
a data warehouse. Data mart is a subset of a data warehouse and is much more popular than data
warehouse.

Some vital issues related to Data Marts.

It is important to note that all the issues related to data mart are equally relevant for a data
warehouse since data warehouse is only a collection of data mart.

1. Types of data marts

The data marts can be classified into two groups:

i. Multidimensional (MDDB OLAP or MOLAP): In a multidimensional data mart,


the numeric data, which is basically multidimensional in its original nature, can be sliced and
diced in a free fashion, i.e. free from data modelling constraints of an DBMS (data on prices
of commodities- multidimensionality; year-wise, region wise, group-item wise). Query
processing and analysis will be much more powerful and simple table processing.
18
ii. Relational OLAP (ROLAP) Data mart may contain both numeric and text data
supported by RDBMS. Used for general purpose DSS analysis with many indices, which
support for star schema.

2. Loading a Data Mart

The Data Mart is loaded with data from a data warehouse by means of a load program. The
chief considerations for a load program are:

i. Frequency and schedule


ii. Total or partial refreshment
iii. Customization of data from the warehouse (re-sequencing and merging of data;
aggregations of data; summarization; efficiency; integrity of data)
iv. Data relationships and integrity of data domains producing Meta data for describing
the loading process.

Metadata is to the data warehouse what the card catalogue is to the traditional library. It serves to
identify the contents and location of the data in the warehouse. Metadata is a bridge between the
data warehouse and the decision support application. In addition to providing a logical linkage
between data and application, metadata can pinpoint access to information across the entire data
warehouse, and can enable the development of

Metadata for a data mart

Metadata describes the details about the data in a data warehouse or in a data mart. Such
description may be in terms of the contents and sources of data that flows into the data
warehouse or data mart.
A data mart is powerful and natural extension of a data warehouse to a specific functional or
departmental usage. The data warehouse provides granular data and various data marts interpret
and structure the granular data to suit their needs.
Following are the components of metadata for a given data warehouse or data mart:

i. Description of sources of the data.


ii. Description of customization that may have taken place as the data passes from data
warehouse into data mart.
iii. Description information about the data mart, its attributes and relationships etc.
iv. Definitions of all types.

Data model for a data mart

A formal data model is required to be built for a large data mart which may also have some
processing involved. This processing can be repetitive or predictable.

No data model is necessary for ordinary or simple small data marts with no processing. Such a
data models should be compatible with DBMS, which handles the data mart. Data marts that do
not have to be compliant to a particular DBMS can be modelled in terms of formal data model
will take care of both summary and detailed levels of data.

19
Maintenance of a data mart: Which means loading, refreshing and purging the data?
Refreshing the data is performed in regular cycles as per the nature of frequency of data update.

Nature of data in data mart: The data in a data mart can be detailed level, summary level, ad
hoc, data pre-processed or prepared data. As a rule, the bulk of a largely populated data mart
contains a lot of adhoc summary data and also a lot of prepared detailed data.

Software components for a data mart: The software that can be found with a data mart
includes: DBMS, access and analysis software, software for automatic creation of data mart,
purging and archival software, and Meta data management software etc.

Table in the data mart: A data mart may include; detailed tables, summary tables, reference
tables (backlog stored as a library, historical tables, and analytical (spreadsheet) tables etc.

The large data is kept in tables in the form of star joins on normalised tables. When data is not
large, relational tables are adequate.

Other aspects of data mart

• External data (if required better store in data warehouse first)


• Reference data (past or historical data)
• Performance issues
• Monitoring Requirements for a data mart
Security in a data mart

__________________________________________________
A metadata repository should contain:

• A description of the structure of the data warehouse. This includes the warehouse
schema, view dimension, hierarchies and derives data definitions, data marts location and
content. Etc.
• Operational metadata, such ad data linkages, currency of the data and monitoring
information (warehouse, usage statistics, error reports, and its trails).
• The summarization process which include dimension, definition, data on granularity,
partitions, summary measure, aggregation, summarization etc.
• Details of data sources which include source databases and their contents, gateway
descriptions, a data partitions, data extractions, clearing, transformation rules and
defaults.
• Data related to system performance, which include indices and profiles that improve data
access And retrieval performances, in addition to rules for timing and scheduling of
refresh, update, and replication cycles; and
• Business metadata, which include business terms and definitions, data ownership
information, and changing policies

Types of Metadata

Build-Time Metadata
Whenever we design and build a warehouse, the metadata that we generate can be termed as
build –time metadata. This metadata links business and warehouse terminology and describe the
20
data’s technical structure. It is the most detailed and extract types of Meta data and I used
extensively by warehouse designers, developers, and administrators. It is the primary source of
most of the metadata used in the warehouse.

Usage Metadata

When the warehouses in production, usage metadata, which 8is derived from, build time
metadata, is an important tool for users and data administrators. This metadata is used differently
from build-time metadata, and its structure must accommodate this fact.

21
Data Warehousing Trends
The trends and direction of the emerging data warehousing
reference architectures shape the enterprise installations of the
data warehouses.
Trends in Data Warehousing -
Industry experience with data warehousing has provided important lessons and also the
emerging trends that shape the industry direction in business intelligence solutions. As a
result, our emerging reference architectures used in building these enterprise data
warehouse solutions are changing to meet business demands.
1. It is these evolving reference architectures that are putting new demands on
the databases that are used in warehousing.

2. With the emergence and evolution of the intranet, as well as more businesses
exploiting semi-structured data, the more traditional business models are
evolving with respect to such things as data accessibility, delivery, and
concurrency.

3. Technology such as XML and web-services become more critical as databases


integrate with web portals and BI tooling.

4. Moreover, additional demands on more broad decision making within


enterprises are causing heavy consolidation and non-traditional mixed workloads
(heavily mixing OLTP and DSS-decision support system) beyond what has been
conventional in the past.

5. Moreover, in many cases the consolidation is not an option and or desired. In such
latter cases, the business question still needs to be run. As a result, federation
augmentation is also very real in enterprise systems.

Note - Query management in a federated environment is still a challenging


task.

A combination of consolidation and federation augmentation is being seen.

6. In addition to heavy consolidation and federation augmentation, both real-time


(right-time) and active data warehousing systems are being built. These
systems present interesting challenges to traditional maintenance and
extract/transformation/load (ETL)operational procedures. Specifically, in large
multi-terabyte systems which are 24x7x365. Queries in such systems that
execute over aggregated data (including materialized views) need to be very
close in time to a consolidated operational data store (ODS) in the same
enterprise data warehouse. The maintenance challenges are pushing the
technology.

7. Finally, the closed loop processing in an enterprise-wide solution, allows


warehouses to play an even more crucial role. Not only are operational systems
creating events, so are data warehouses; they play a crucial active role in an
enterprise.

One such example of events produced in a warehouse is measures, which


may be key business indicators (KPIs) used in business performance
monitoring through portals.

22
23
24

You might also like