Data Warehouse: Dr. Vaibhav Sharma
Data Warehouse: Dr. Vaibhav Sharma
Data Warehouse: Dr. Vaibhav Sharma
vaibhav Sharma
Data warehouse
What is a Data Warehouse?
[Barry Devlin]
A single, complete and consistent store of data obtained from a variety of different sources made
available to end users in a what they can understand and use in a business context.
Inmon’s definition of Data Warehouse: In 1993, the "father of data warehousing", Bill
Inmon, gave this definition of a data warehouse as: A data warehouse is subject-oriented,
integrated, time-variant, nonvolatile collection of data in support of management’s decision
making process.
Data Warehouse Usage:-
1. Data warehouses and data marts are used in a wide range of applications.
2. Business executives use the data in data warehouses and data marts to perform data analysis
and make strategic decisions.
3. In many areas, data warehouses are used as an integral part for enterprise management.
4. The data warehouse is mainly used for generating reports and answering predefined queries.
5. It is used to analyze summarized and detailed data, where the results are presented in the form
of reports and charts.
6. Later, the data warehouse is used for strategic purposes, performing multidimensional analysis
and sophisticated operations.
7. Finally, the data warehouse may be employed for knowledge discovery and strategic decision
making using data mining tools.
8. In this context, the tools for data warehousing can he categorized into access and retrieval
tools, database reporting tools, data analysis tools, and data mining tools.
Reasons for data Warehouse:
There are a few reasons why a data warehouse should exist:
a) You want to integrate data across functions or systems to provide a complete picture of the
data subject e.g. customer orders, customer complaints, salespersons.
b) You do not want to interfere with the fast performing transaction systems by running
large computer resource queries and reports whilst routine users and possibly customers are
executing the essential business transactions.
c) You want to reorganize the data to support fast reporting and querying.
d) You want to clean up the quality of the data to give consistency and data integrity. Many
systems do not have strict input validation and duplicates e.g. same customer entered more
than once. Also there often different definitions for the same subject or entity within the
business e.g. customer, client, prospect.
1. Banking Industry
In the banking industry, concentration is given to risk management and policy reversal as well
analyzing consumer data, market trends, government regulations and reports, and more
importantly financial decision making.
Most banks also use warehouses to manage the resources available on deck in an effective
manner. Certain banking sectors utilize them for market research, performance analysis of each
product, interchange and exchange rates, and to develop marketing programs.
Analysis of card holder’s transactions, spending patterns and merchant classification, all of
which provide the bank with an opportunity to introduce special offers and lucrative deals based
on cardholder activity. Apart from all these, there is also scope for co-branding.
2. Finance Industry
Similar to the applications seen in banking, mainly revolve around evaluation and trends of
customer expenses which aids in maximizing the profits earned by their clients.
The federal government utilizes the warehouses for research in compliance, whereas the state
government uses it for services related to human resources like recruitment, and accounting like
payroll management. The government uses data warehouses to maintain and analyze tax records,
health policy records and their respective providers, and also their entire criminal law database is
connected to the state’s data warehouse. Criminal activity is predicted from the patterns and
trends, results of the analysis of historical data associated with past criminals.
Universities use warehouses for extracting of information used for the proposal of research
grants, understanding their student demographics, and human resource management. The entire
financial department of most universities depends on data warehouses, inclusive of the Financial
Aid department.
5. Healthcare
One of the most important sector which utilizes data warehouses is the Healthcare sector. All of
their financial, clinical, and employee records are fed to warehouses as it helps them to strategize
and predict outcomes, track and analyze their service feedback, generate patient reports, share
data with tie-in insurance companies, medical aid services, etc.
6. Hospitality Industry
A major proportion of this industry is dominated by hotel and restaurant services, car rental
services, and holiday home services. They utilize warehouse services to design and evaluate their
advertising and promotion campaigns where they target customers based on their feedback and
travel patterns.
7. Insurance
As the saying goes in the insurance services sector, “Insurance can never be bought, it can be
only be sold”, the warehouses are primarily used to analyze data patterns and customer trends,
apart from maintaining records of already existing participants. The design of tailor-made
customer offers and promotions is also possible through warehouses.
This industry is one of the most important sources of income for any state. A manufacturing
organization has to take several make-or-buy decisions which can influence the future of the
sector, which is why they utilize high-end OLAP tools as a part of data warehouses to predict
market changes, analyze current business trends, detect warning conditions, view marketing
developments, and ultimately take better decisions.
They also use them for product shipment records, records of product portfolios, identify
profitable product lines, analyze previous data and customer feedback to evaluate the weaker
Dr. vaibhav Sharma
product lines and eliminate them. For the distributions, the supply chain management of products
operates through data warehouses.
9. The Retailers
Retailers serve as middlemen between producers and consumers. It is important for them to
maintain records of both the parties to ensure their existence in the market. They use warehouses
to track items, their advertising promotions, and the consumers buying trends. They also analyze
sales to determine fast selling and slow selling product lines and determine their shelf space
through a process of elimination.
Data warehouses find themselves to be of use in the service sector for maintenance of financial
records, revenue patterns, customer profiling, resource management, and human resources.
The telephone industry operates over both offline and online data burdening them with a lot of
historical data which has to be consolidated and integrated. Apart from those operations, analysis
of fixed assets, analysis of customer’s calling patterns for sales representatives to push
advertising campaigns, and tracking of customer queries, all require the facilities of a data
warehouse.
In the transportation industry, data warehouses record customer data enabling traders to
experiment with target marketing where the marketing campaigns are designed by keeping
customer requirements in mind.
The internal environment of the industry uses them to analyze customer feedback, performance,
manage crews on board as well as analyze customer financial reports for pricing strategies.
Data Warehouse
Data warehouse is an information system that contains historical and commutative data from
single or multiple sources. It simplifies reporting and analysis process of the organization. It is
also a single version of truth for any company for decision making and forecasting.
Subject-Oriented
Integrated
Time-variant
Non-volatile
(i). Subject-Oriented
A data warehouse is subject oriented as it offers information regarding a theme
instead of companies' ongoing operations. These subjects can be sales, marketing,
distributions, etc.
Dr. vaibhav Sharma
A data warehouse never focuses on the ongoing operations. Instead, it put emphasis
on modeling and analysis of data for decision making. It also provides a simple and
concise view around the specific subject by excluding data which not helpful to
support the decision process.
(ii). Integrated
In Data Warehouse, integration means the establishment of a common unit of measure
for all similar data from the dissimilar database. The data also needs to be stored in
the Datawarehouse in common and universally acceptable manner.
In the above example, there are three different application labeled A, B and C. Information
stored in these applications are Gender, Date, and Balance. However, each application's data is
stored different way.
However, after transformation and cleaning process all this data is stored in common format in
the Data Warehouse.
(iii). Time-Variant
The time horizon for data warehouse is quite extensive compared with operational systems. The
data collected in a data warehouse is recognized with a particular period and offers information
from the historical point of view. It contains an element of time, explicitly or implicitly.
One such place where Datawarehouse data display time variance is in in the structure of the
record key. Every primary key contained with the DW should have either implicitly or explicitly
an element of time. Like the day, week month, etc.
Dr. vaibhav Sharma
Another aspect of time variance is that once data is inserted in the warehouse, it can't be updated
or changed.
(iv). Non-volatile
Data warehouse is also non-volatile means the previous data is not erased when new data is
entered in it.
Data is read-only and periodically refreshed. This also helps to analyze historical data and
understand what & when happened. It does not require transaction process, recovery and
concurrency control mechanisms.
Activities like delete, update, and insert which are performed in an operational application
environment are omitted in Data warehouse environment. Only two types of data operations
performed in the Data Warehousing are
1. Data loading
2. Data access
Here, are some major differences between Application and Data Warehouse
Complex program must be coded to This kind of issues does not happen because data
make sure that data upgrade processes update is not performed.
maintain high integrity of the final
product.
In computing, a data warehouse (DW or DWH), also known as an enterprise data warehouse
(EDW), is a system used for reporting and data analysis. DWs are central repositories of
integrated data from one or more disparate sources. They store current and historical data and are
used for creating analytical reports for knowledge workers throughout the enterprise. Examples
of reports could range from annual and quarterly comparisons and trends to detailed daily sales
analyses.
The data stored in the warehouse is uploaded from the operational systems (such as marketing,
sales, etc., shown in the figure to the right). The data may pass through an operational data store
for additional operations before it is used in the DW for reporting.
A data warehouse is constructed by integrating data from multiple heterogeneous sources. It supports
analytical reporting, structured and/or ad hoc queries and decision making. This tutorial adopts a step-
by-step approach to explain all the necessary concepts of data warehousing.
The term "Data Warehouse" was first coined by Bill Inmon in 1990. According to Inmon, a data
warehouse is a subject oriented, integrated, time-variant, and non-volatile collection of data. This
data helps analysts to take informed decisions in an organization.
Data mining functions such as association, clustering, classification, prediction can be integrated
with OLAP operations to enhance the interactive mining of knowledge at multiple level of
abstraction. That's why data warehouse has now become an important platform for data analysis
and online analytical processing.
Note: A data warehouse does not require transaction processing, recovery, and concurrency
controls, because it is physically stored and separate from the operational database.
Information processing, analytical processing, and data mining are the three types of data
warehouse applications that are discussed below:
Information Processing - A data warehouse allows to process the data stored in it. The
data can be processed by means of querying, basic statistical analysis, reporting using
crosstabs, tables, charts, or graphs.
Analytical Processing - A data warehouse supports analytical processing of the
information stored in it. The data can be analyzed by means of basic OLAP operations,
including slice-and-dice, drill down, drill up, and pivoting.
Data Mining - Data mining supports knowledge discovery by finding hidden patterns
and associations, constructing analytical models, performing classification and
prediction. These mining results can be presented using the visualization tools.
12 The database size is from 100GB to 100 TB. The database size is from 100 MB to 100 GB.
1. Project initiation
2. Requirements analysis
3. Design (architecture, databases and applications)
4. Construction (selecting and installing tools, developing data feeds and
building reports)
5. Deployment (release & training)
6. Maintenance
1. Project initiation
No data warehousing project should commence without:
A small team is usually set up to prepare and present a suitable project initiation
document. This is normally a joint effort between business and IT managers. If the
organization has limited data warehousing experience, it is useful to obtain
Dr. vaibhav Sharma
external advice at this stage. If the project goes ahead, the project plan and
business case should be reviewed at each stage.
2. Requirements analysis
Establishing a broad view of the business’ requirements should always be the first
step. The understanding gained will guide everything that follows, and the details
can be filled in for each phase in turn.
Interviewing a number of potential users to find out what they do, the
information they need and how they analyse it in order to make decisions.
It is often helpful to analyse some of the reports they currently use.
Interviewing information systems specialists to find out what data are
available in potential source systems, and how they are organised.
Analysing the requirements to establish those that are feasible given
available data.
Running facilitated workshops that bring representative users and IT staff
together to build consensus about what is needed, what is feasible and
where to start.
3. Design
The goal of the design process is to define the warehouse components that will
need to be built. The architecture, data and application designs are all inter-
related, and are normally produced in parallel.
The logical design determines what data are stored in the main data warehouse
and any associated functional data marts. There are a number of data modelling
techniques that can be used to help.
Once the logical design is established, the next step is to define the physical
characteristics of individual data stores (including aggregates) and any associated
indexes required to optimize performance (see database optimization).
The data design is critical to further progress, in that it defines the target for the
data feeds and provides the source data for all reporting and analysis
applications.
There may be one or more applications associated with each data mart or phase
of development.
4. Construction
Warehouse components are usually developed iteratively and in parallel. That
said, the most efficient sequence to begin construction is probably as follows:
ETL tool
Database(s) for the warehouse (usually relational) and marts
(often multi-dimensional)
Reporting and analysis tools
However thorough the design process, problems with the real data are bound to
surface at this stage. Substantial time should be allowed to resolve any issues that
arise, establish appropriate data cleansing procedures (preferably within the
source systems environment) and to validate all data before they are released for
live use.
5. Deployment
It is too often assumed that the first version of a data warehouse can be rolled
out in a matter of weeks, simply by showing all the users how to use the new
reporting tools.
In practice, training needs to cover not just the basic use of the tools, but also the
data that have been made available, and, more significantly perhaps, the new
business processes or different ways of working that are intended. This training
usually works best if delivered on a one-to-one basis.
If the first users find errors and inconsistencies in the data, don’t feel comfortable
with the tool or can’t be bothered to learn how to use it properly, or won’t accept
new procedures and responsibilities, all the time spent building the warehouse
may ultimately be wasted. The following guidelines will help to reduce these risks:
Do not start deployment until the data are ready (available and validated)
and the tools and update procedures have been tested;
Use a small, representative group to try out the finished system before
rolling out, including users with a range of abilities and attitudes;
Do not grant system access to users until they have been trained.
6. Maintenance
A data warehouse is not like an OLTP system: development is never finished, but
follows an iterative cycle (analyse – build – deploy). Also, once live, a warehousing
environment requires substantial effort to keep running. Thus the development
team should not anticipate handing over and moving on to other projects, but to
spend half of their time on support and maintenance.
In the bottom-up design approach, the data marts are created first to
provide reporting capability. A data mart addresses a single business
area such as sales, Finance etc. These data marts are then integrated
to build a complete data warehouse. The integration of data marts is
Dr. vaibhav Sharma
implemented using data warehouse bus architecture. In the bus
architecture, a dimension is shared between facts in two or more data
marts. These dimensions are called conformed dimensions. These
conformed dimensions are integrated from data marts and then data
warehouse is built.
2. Top-Down Design:
In the top-down design approach the, data warehouse is built first. The data
marts are then created from the data warehouse.