What Is A Data Warehouse

Download as docx, pdf, or txt
Download as docx, pdf, or txt
You are on page 1of 34

What is a Data Warehouse?

Tutorial, Characteristics, Concepts - javatpoint

Data Warehouse Tutorial

Data Warehouse is a relational database management system (RDBMS)


construct to meet the requirement of transaction processing systems. It can
be loosely described as any centralized data repository which can be queried
for business benefits. It is a database that stores information oriented to
satisfy decision-making requests. It is a group of decision support
technologies, targets to enabling the knowledge worker (executive,
manager, and analyst) to make superior and higher decisions. So, Data
Warehousing support architectures and tool for business executives to
systematically organize, understand and use their information to make
strategic decisions.

Data Warehouse environment contains an extraction, transportation, and


loading (ETL) solution, an online analytical processing (OLAP) engine,
customer analysis tools, and other applications that handle the process of
gathering information and delivering it to business users.

What is a Data Warehouse?


A Data Warehouse (DW) is a relational database that is designed for query
and analysis rather than transaction processing. It includes historical data
derived from transaction data from single and multiple sources.

A Data Warehouse provides integrated, enterprise-wide, historical data and


focuses on providing support for decision-makers for data modeling and
analysis.

A Data Warehouse is a group of data specific to the entire organization, not


only to a particular group of users.
It is not used for daily operations and transaction processing but used for
making decisions.

A Data Warehouse can be viewed as a data system with the following


attributes:

o It is a database designed for investigative tasks, using data from


various applications.
o It supports a relatively small number of clients with relatively long
interactions.
o It includes current and historical data to provide a historical
perspective of information.
o Its usage is read-intensive.
o It contains a few large tables.

"Data Warehouse is a subject-oriented, integrated, and time-variant store of


information in support of management's decisions."

Characteristics of Data Warehouse


Subject-Oriented
A data warehouse target on the modeling and analysis of data for decision-
makers. Therefore, data warehouses typically provide a concise and
straightforward view around a particular subject, such as customer, product,
or sales, instead of the global organization's ongoing operations. This is done
by excluding data that are not useful concerning the subject and including all
data needed by the users to understand the subject.
AD

Integrated
A data warehouse integrates various heterogeneous data sources like
RDBMS, flat files, and online transaction records. It requires performing data
cleaning and integration during data warehousing to ensure consistency in
naming conventions, attributes types, etc., among different data sources.
Time-Variant
Historical information is kept in a data warehouse. For example, one can
retrieve files from 3 months, 6 months, 12 months, or even previous data
from a data warehouse. These variations with a transactions system, where
often only the most current file is kept.
Non-Volatile
The data warehouse is a physically separate data storage, which is
transformed from the source operational RDBMS. The operational updates of
data do not occur in the data warehouse, i.e., update, insert, and delete
operations are not performed. It usually requires only two procedures in data
accessing: Initial loading of data and access to data. Therefore, the DW does
not require transaction processing, recovery, and concurrency capabilities,
which allows for substantial speedup of data retrieval. Non-Volatile defines
that once entered into the warehouse, and data should not change.
History of Data Warehouse
The idea of data warehousing came to the late 1980's when IBM researchers
Barry Devlin and Paul Murphy established the "Business Data Warehouse."

In essence, the data warehousing idea was planned to support an


architectural model for the flow of information from the operational system
to decisional support environments. The concept attempt to address the
various problems associated with the flow, mainly the high costs associated
with it.

In the absence of data warehousing architecture, a vast amount of space


was required to support multiple decision support environments. In large
corporations, it was ordinary for various decision support environments to
operate independently.

Goals of Data Warehousing


o To help reporting as well as analysis
o Maintain the organization's historical information
o Be the foundation for decision making.
Need for Data Warehouse
Data Warehouse is needed for the following reasons:

1. 1) Business User: Business users require a data warehouse to view


summarized data from the past. Since these people are non-technical,
the data may be presented to them in an elementary form.
2. 2) Store historical data: Data Warehouse is required to store the
time variable data from the past. This input is made to be used for
various purposes.
3. 3) Make strategic decisions: Some strategies may be depending
upon the data in the data warehouse. So, data warehouse contributes
to making strategic decisions.
4. 4) For data consistency and quality: Bringing the data from
different sources at a commonplace, the user can effectively undertake
to bring the uniformity and consistency in data.
5. 5) High response time: Data warehouse has to be ready for
somewhat unexpected loads and types of queries, which demands a
significant degree of flexibility and quick response time.
Benefits of Data Warehouse
1. Understand business trends and make better forecasting decisions.
2. Data Warehouses are designed to perform well enormous amounts of
data.
3. The structure of data warehouses is more accessible for end-users to
navigate, understand, and query.
4. Queries that would be complex in many normalized databases could be
easier to build and maintain in data warehouses.
5. Data warehousing is an efficient method to manage demand for lots of
information from lots of users.
6. Data warehousing provide the capabilities to analyze a large amount of
historical data.

Components or Building Blocks of Data


Warehouse
Architecture is the proper arrangement of the elements. We build a data
warehouse with software and hardware components. To suit the
requirements of our organizations, we arrange these building we may want
to boost up another part with extra tools and services. All of these depends
on our circumstances.
The figure shows the essential elements of a typical warehouse. We see the
Source Data component shows on the left. The Data staging element serves
as the next building block. In the middle, we see the Data Storage
component that handles the data warehouses data. This element not only
stores and manages the data; it also keeps track of data using the metadata
repository. The Information Delivery component shows on the right consists
of all the different ways of making the information from the data warehouses
available to the users.

Source Data Component


Source data coming into the data warehouses may be grouped into four
broad categories:

Production Data: This type of data comes from the different operating
systems of the enterprise. Based on the data requirements in the data
warehouse, we choose segments of the data from the various operational
modes.

Internal Data: In each organization, the client keeps their "private"


spreadsheets, reports, customer profiles, and sometimes even department
databases. This is the internal data, part of which could be useful in a data
warehouse.
Archived Data: Operational systems are mainly intended to run the current
business. In every operational system, we periodically take the old data and
store it in achieved files.

External Data: Most executives depend on information from external


sources for a large percentage of the information they use. They use
statistics associating to their industry produced by the external department.

Data Staging Component


After we have been extracted data from various operational systems and
external sources, we have to prepare the files for storing in the data
warehouse. The extracted data coming from several different sources need
to be changed, converted, and made ready in a format that is relevant to be
saved for querying and analysis.

We will now discuss the three primary functions that take place in the
staging area.

1) Data Extraction: This method has to deal with numerous data sources.
We have to employ the appropriate techniques for each data source.
2) Data Transformation: As we know, data for a data warehouse comes
from many different sources. If data extraction for a data warehouse posture
big challenges, data transformation present even significant challenges. We
perform several individual tasks as part of data transformation.

First, we clean the data extracted from each source. Cleaning may be the
correction of misspellings or may deal with providing default values for
missing data elements, or elimination of duplicates when we bring in the
same data from various source systems.

Standardization of data components forms a large part of data


transformation. Data transformation contains many forms of combining
pieces of data from different sources. We combine data from single source
record or related data parts from many source records.

On the other hand, data transformation also contains purging source data
that is not useful and separating outsource records into new combinations.
Sorting and merging of data take place on a large scale in the data staging
area. When the data transformation function ends, we have a collection of
integrated data that is cleaned, standardized, and summarized.

3) Data Loading: Two distinct categories of tasks form data loading


functions. When we complete the structure and construction of the data
warehouse and go live for the first time, we do the initial loading of the
information into the data warehouse storage. The initial load moves high
volumes of data using up a substantial amount of time.

Data Storage Components


Data storage for the data warehousing is a split repository. The data
repositories for the operational systems generally include only the current
data. Also, these data repositories include the data structured in highly
normalized for fast and efficient processing.

Components or Building Blocks of Data


Warehouse
Architecture is the proper arrangement of the elements. We build a data
warehouse with software and hardware components. To suit the
requirements of our organizations, we arrange these building we may want
to boost up another part with extra tools and services. All of these depends
on our circumstances.
The figure shows the essential elements of a typical warehouse. We see the
Source Data component shows on the left. The Data staging element serves
as the next building block. In the middle, we see the Data Storage
component that handles the data warehouses data. This element not only
stores and manages the data; it also keeps track of data using the metadata
repository. The Information Delivery component shows on the right consists
of all the different ways of making the information from the data warehouses
available to the users.

Source Data Component


Source data coming into the data warehouses may be grouped into four
broad categories:

Production Data: This type of data comes from the different operating
systems of the enterprise. Based on the data requirements in the data
warehouse, we choose segments of the data from the various operational
modes.

Internal Data: In each organization, the client keeps their "private"


spreadsheets, reports, customer profiles, and sometimes even department
databases. This is the internal data, part of which could be useful in a data
warehouse.
Archived Data: Operational systems are mainly intended to run the current
business. In every operational system, we periodically take the old data and
store it in achieved files.

External Data: Most executives depend on information from external


sources for a large percentage of the information they use. They use
statistics associating to their industry produced by the external department.

Data Staging Component


After we have been extracted data from various operational systems and
external sources, we have to prepare the files for storing in the data
warehouse. The extracted data coming from several different sources need
to be changed, converted, and made ready in a format that is relevant to be
saved for querying and analysis.

We will now discuss the three primary functions that take place in the
staging area.

1) Data Extraction: This method has to deal with numerous data sources.
We have to employ the appropriate techniques for each data source.
2) Data Transformation: As we know, data for a data warehouse comes
from many different sources. If data extraction for a data warehouse posture
big challenges, data transformation present even significant challenges. We
perform several individual tasks as part of data transformation.

First, we clean the data extracted from each source. Cleaning may be the
correction of misspellings or may deal with providing default values for
missing data elements, or elimination of duplicates when we bring in the
same data from various source systems.

Standardization of data components forms a large part of data


transformation. Data transformation contains many forms of combining
pieces of data from different sources. We combine data from single source
record or related data parts from many source records.

On the other hand, data transformation also contains purging source data
that is not useful and separating outsource records into new combinations.
Sorting and merging of data take place on a large scale in the data staging
area. When the data transformation function ends, we have a collection of
integrated data that is cleaned, standardized, and summarized.

3) Data Loading: Two distinct categories of tasks form data loading


functions. When we complete the structure and construction of the data
warehouse and go live for the first time, we do the initial loading of the
information into the data warehouse storage. The initial load moves high
volumes of data using up a substantial amount of time.

Data Storage Components


Data storage for the data warehousing is a split repository. The data
repositories for the operational systems generally include only the current
data. Also, these data repositories include the data structured in highly
normalized for fast and efficient processing.

AD

Information Delivery Component


The information delivery element is used to enable the process of
subscribing for data warehouse files and having it transferred to one or more
destinations according to some customer-specified scheduling algorithm.

AD
Metadata Component
Metadata in a data warehouse is equal to the data dictionary or the data
catalog in a database management system. In the data dictionary, we keep
the data about the logical data structures, the data about the records and
addresses, the information about the indexes, and so on.

Data Marts
It includes a subset of corporate-wide data that is of value to a specific group
of users. The scope is confined to particular selected subjects. Data in a data
warehouse should be a fairly current, but not mainly up to the minute,
although development in the data warehouse industry has made standard
and incremental data dumps more achievable. Data marts are lower than
data warehouses and usually contain organization. The current trends in
data warehousing are to developed a data warehouse with several smaller
related data marts for particular kinds of queries and reports.

Management and Control Component


The management and control elements coordinate the services and
functions within the data warehouse. These components control the data
transformation and the data transfer into the data warehouse storage. On
the other hand, it moderates the data delivery to the clients. Its work with
the database management systems and authorizes data to be correctly
saved in the repositories. It monitors the movement of information into the
staging method and from there into the data warehouses storage itself.

Why we need a separate Data Warehouse?


Data Warehouse queries are complex because they involve the computation
of large groups of data at summarized levels.

It may require the use of distinctive data organization, access, and


implementation method based on multidimensional views.

AD

Performing OLAP queries in operational database degrade the performance


of functional tasks.

Data Warehouse is used for analysis and decision making in which extensive
database is required, including historical data, which operational database
does not typically maintain.

The separation of an operational database from data warehouses is based on


the different structures and uses of data in these systems.

Because the two systems provide different functionalities and require


different kinds of data, it is necessary to maintain separate databases.

Difference between Database and Data Warehouse


Database Data Warehouse

1. It is used for Online Transactional Processing 1. It is used for Online Analytical


(OLTP) but can be used for other objectives such Processing (OLAP). This reads the
as Data Warehousing. This records the data from historical information for the customers
the clients for history. for business decisions.

2. The tables and joins are complicated since they 2. The tables and joins are accessible
are normalized for RDBMS. This is done to reduce since they are de-normalized. This is
redundant files and to save storage space. done to minimize the response time for
analytical queries.

3. Data is dynamic 3. Data is largely static

4. Entity: Relational modeling procedures are 4. Data: Modeling approach are used for
used for RDBMS database design. the Data Warehouse design.

5. Optimized for write operations. 5. Optimized for read operations.

6. Performance is low for analysis queries. 6. High performance for analytical


queries.

7. The database is the place where the data is 7. Data Warehouse is the place where
taken as a base and managed to get available fast the application data is handled for
and efficient access. analysis and reporting objectives.

Difference between Operational Database and


Data Warehouse
The Operational Database is the source of information for the data
warehouse. It includes detailed information used to run the day to day
operations of the business. The data frequently changes as updates are
made and reflect the current value of the last transactions.

Operational Database Management Systems also called as OLTP (Online


Transactions Processing Databases), are used to manage dynamic data in
real-time.

Data Warehouse Systems serve users or knowledge workers in the purpose


of data analysis and decision-making. Such systems can organize and
present information in specific formats to accommodate the diverse needs of
various users. These systems are called as Online-Analytical Processing
(OLAP) Systems.

Data Warehouse and the OLTP database are both relational databases.
However, the goals of both these databases are different.

Operational Database Data Warehouse

Operational systems are designed to support Data warehousing systems are typically
high-volume transaction processing. designed to support high-volume analytical
processing (i.e., OLAP).

Operational systems are usually concerned Data warehousing systems are usually
with current data. concerned with historical data.

Data within operational systems are mainly Non-volatile, new data may be added
updated regularly according to need. regularly. Once Added rarely changed.

It is designed for real-time business dealing It is designed for analysis of business


and processes. measures by subject area, categories, and
attributes.

It is optimized for a simple set of transactions, It is optimized for extent loads and high,
generally adding or retrieving a single row at a complex, unpredictable queries that access
time per table. many rows per table.

It is optimized for validation of incoming Loaded with consistent, valid information,


information during transactions, uses validation requires no real-time validation.
data tables.

It supports thousands of concurrent clients. It supports a few concurrent clients relative


to OLTP.

Operational systems are widely process- Data warehousing systems are widely
oriented. subject-oriented

Operational systems are usually optimized to Data warehousing systems are usually
perform fast inserts and updates of optimized to perform fast retrievals of
associatively small volumes of data. relatively high volumes of data.

Data In Data Out

Less Number of data accessed. Large Number of data accessed.

Relational databases are created for on-line Data Warehouse designed for on-line
transactional Processing (OLTP) Analytical Processing (OLAP)

Difference between OLTP and OLAP


OLTP System
OLTP System handle with operational data. Operational data are those data
contained in the operation of a particular system. Example, ATM transactions
and Bank transactions, etc.

OLAP System
OLAP handle with Historical Data or Archival Data. Historical data are those
data that are achieved over a long period. For example, if we collect the last
10 years information about flight reservation, the data can give us much
meaningful data such as the trends in the reservation. This may provide
useful information like peak time of travel, what kind of people are traveling
in various classes (Economy/Business) etc.

The major difference between an OLTP and OLAP system is the amount of
data analyzed in a single transaction. Whereas an OLTP manage many
concurrent customers and queries touching only an individual record or
limited groups of files at a time. An OLAP system must have the capability to
operate on millions of files to answer a single query.

Feature OLTP OLAP

Characteristic It is a system which is used to It is a system which is used to manage


manage operational Data. informational Data.

Users Clerks, clients, and information Knowledge workers, including managers,


technology professionals. executives, and analysts.
System OLTP system is a customer- OLAP system is market-oriented, knowledge
orientation oriented, transaction, and workers including managers, do data
query processing are done by analysts executive and analysts.
clerks, clients, and information
technology professionals.

Data contents OLTP system manages current OLAP system manages a large amount of
data that too detailed and are historical data, provides facilitates for
used for decision making. summarization and aggregation, and stores
and manages data at different levels of
granularity. This information makes the data
more comfortable to use in informed
decision making.

Database Size 100 MB-GB 100 GB-TB

Database OLTP system usually uses an OLAP system typically uses either a star or
design entity-relationship (ER) data snowflake model and subject-oriented
model and application-oriented database design.
database design.

View OLTP system focuses primarily OLAP system often spans multiple versions
on the current data within an of a database schema, due to the
enterprise or department, evolutionary process of an organization.
without referring to historical OLAP systems also deal with data that
information or data in different originates from various organizations,
organizations. integrating information from many data
stores.

Volume of Not very large Because of their large volume, OLAP data
data are stored on multiple storage media.

Access The access patterns of an Accesses to OLAP systems are mostly read-
patterns OLTP system subsist mainly of only methods because of these data
short, atomic transactions. warehouses stores historical data.
Such a system requires
concurrency control and
recovery techniques.

Access mode Read/write Mostly write

Insert and Short and fast inserts and Periodic long-running batch jobs refresh the
Updates updates proposed by end- data.
users.

Number of Tens Millions


records
accessed

Normalization Fully Normalized Partially Normalized

Processing Very Fast It depends on the amount of files contained,


Speed batch data refresh, and complex query may
take many hours, and query speed can be
upgraded by creating indexes.

Data Warehouse Architecture


A data warehouse architecture is a method of defining the overall
architecture of data communication processing and presentation that exist
for end-clients computing within the enterprise. Each data warehouse is
different, but all are characterized by standard vital components.

Production applications such as payroll accounts payable product purchasing


and inventory control are designed for online transaction processing (OLTP).
Such applications gather detailed data from day to day operations.

Data Warehouse applications are designed to support the user ad-hoc data
requirements, an activity recently dubbed online analytical processing
(OLAP). These include applications such as forecasting, profiling, summary
reporting, and trend analysis.

Production databases are updated continuously by either by hand or via


OLTP applications. In contrast, a warehouse database is updated from
operational systems periodically, usually during off-hours. As OLTP data
accumulates in production databases, it is regularly extracted, filtered, and
then loaded into a dedicated warehouse server that is accessible to users. As
the warehouse is populated, it must be restructured tables de-normalized,
data cleansed of errors and redundancies and new fields and keys added to
reflect the needs to the user for sorting, combining, and summarizing data.
Data warehouses and their architectures very depending upon the elements
of an organization's situation.

Three common architectures are:

o Data Warehouse Architecture: Basic


o Data Warehouse Architecture: With Staging Area
o Data Warehouse Architecture: With Staging Area and Data Marts

Data Warehouse Architecture: Basic

Operational System

An operational system is a method used in data warehousing to refer to


a system that is used to process the day-to-day transactions of an
organization.

Flat Files

A Flat file system is a system of files in which transactional data is stored,


and every file in the system must have a different name.

Meta Data
A set of data that defines and gives information about other data.

Meta Data used in Data Warehouse for a variety of purpose, including:

Meta Data summarizes necessary information about data, which can make
finding and work with particular instances of data more accessible. For
example, author, data build, and data changed, and file size are examples of
very basic document metadata.

Metadata is used to direct a query to the most appropriate data source.

Lightly and highly summarized data

The area of the data warehouse saves all the predefined lightly and highly
summarized (aggregated) data generated by the warehouse manager.

AD

The goals of the summarized information are to speed up query


performance. The summarized record is updated continuously as new
information is loaded into the warehouse.

End-User access Tools

The principal purpose of a data warehouse is to provide information to the


business managers for strategic decision-making. These customers interact
with the warehouse using end-client access tools.

The examples of some of the end-user access tools can be:

o Reporting and Query Tools


o Application Development Tools
o Executive Information Systems Tools
o Online Analytical Processing Tools
o Data Mining Tools

AD

Data Warehouse Architecture: With Staging Area


We must clean and process your operational information before put it into
the warehouse.

AD
W

e can do this programmatically, although data warehouses uses a staging


area (A place where data is processed before entering the warehouse).

A staging area simplifies data cleansing and consolidation for operational


method coming from multiple source systems, especially for enterprise data
warehouses where all relevant data of an enterprise is consolidated.

Data Warehouse Staging Area is a temporary location where a record


from source systems is copied.
Data Warehouse Architecture: With Staging Area and Data Marts
We may want to customize our warehouse's architecture for multiple groups
within our organization.

We can do this by adding data marts. A data mart is a segment of a data


warehouses that can provided information for reporting and analysis on a
section, unit, department or operation in the company, e.g., sales, payroll,
production, etc.

The figure illustrates an example where purchasing, sales, and stocks are
separated. In this example, a financial analyst wants to analyze historical
data for purchases and sales or mine historical information to make
predictions about customer behavior.
AD

Properties of Data Warehouse Architectures


The following architecture properties are necessary for a data warehouse
system:
1. Separation: Analytical and transactional processing should be keep apart
as much as possible.

2. Scalability: Hardware and software architectures should be simple to


upgrade the data volume, which has to be managed and processed, and the
number of user's requirements, which have to be met, progressively
increase.

3. Extensibility: The architecture should be able to perform new operations


and technologies without redesigning the whole system.

4. Security: Monitoring accesses are necessary because of the strategic


data stored in the data warehouses.

5. Administerability: Data Warehouse management should not be


complicated.

Types of Data Warehouse Architectures


Single-Tier Architecture
Single-Tier architecture is not periodically used in practice. Its purpose is to
minimize the amount of data stored to reach this goal; it removes data
redundancies.

The figure shows the only layer physically available is the source layer. In
this method, data warehouses are virtual. This means that the data
warehouse is implemented as a multidimensional view of operational data
created by specific middleware, or an intermediate processing layer.
The vulnerability of this architecture lies in its failure to meet the
requirement for separation between analytical and transactional processing.
Analysis queries are agreed to operational data after the middleware
interprets them. In this way, queries affect transactional workloads.

Two-Tier Architecture
The requirement for separation plays an essential role in defining the two-
tier architecture for a data warehouse system, as shown in fig:
Although it is typically called two-layer architecture to highlight a separation
between physically available sources and data warehouses, in fact, consists
of four subsequent data flow stages:

1. Source layer: A data warehouse system uses a heterogeneous source of


data. That data is stored initially to corporate relational databases or legacy
databases, or it may come from an information system outside the corporate
walls.
2. Data Staging: The data stored to the source should be extracted, cleansed
to remove inconsistencies and fill gaps, and integrated to merge
heterogeneous sources into one standard schema. The so-
named Extraction, Transformation, and Loading Tools (ETL) can
combine heterogeneous schemata, extract, transform, cleanse, validate,
filter, and load source data into a data warehouse.
3. Data Warehouse layer: Information is saved to one logically centralized
individual repository: a data warehouse. The data warehouses can be directly
accessed, but it can also be used as a source for creating data marts, which
partially replicate data warehouse contents and are designed for specific
enterprise departments. Meta-data repositories store information on sources,
access procedures, data staging, users, data mart schema, and so on.
4. Analysis: In this layer, integrated data is efficiently, and flexible accessed to
issue reports, dynamically analyze information, and simulate hypothetical
business scenarios. It should feature aggregate information navigators,
complex query optimizers, and customer-friendly GUIs.

Three-Tier Architecture
The three-tier architecture consists of the source layer (containing multiple
source system), the reconciled layer and the data warehouse layer
(containing both data warehouses and data marts). The reconciled layer sits
between the source data and data warehouse.

The main advantage of the reconciled layer is that it creates a standard


reference data model for a whole enterprise. At the same time, it separates
the problems of source data extraction and integration from those of data
warehouse population. In some cases, the reconciled layer is also directly
used to accomplish better some operational tasks, such as producing daily
reports that cannot be satisfactorily prepared using the corporate
applications or generating data flows to feed external processes periodically
to benefit from cleaning and integration.

This architecture is especially useful for the extensive, enterprise-wide


systems. A disadvantage of this structure is the extra file storage space used
through the extra redundant reconciled layer. It also makes the analytical
tools a little further away from being real-time.

You might also like