What Is A Data Warehouse
What Is A Data Warehouse
What Is A Data Warehouse
Integrated
A data warehouse integrates various heterogeneous data sources like
RDBMS, flat files, and online transaction records. It requires performing data
cleaning and integration during data warehousing to ensure consistency in
naming conventions, attributes types, etc., among different data sources.
Time-Variant
Historical information is kept in a data warehouse. For example, one can
retrieve files from 3 months, 6 months, 12 months, or even previous data
from a data warehouse. These variations with a transactions system, where
often only the most current file is kept.
Non-Volatile
The data warehouse is a physically separate data storage, which is
transformed from the source operational RDBMS. The operational updates of
data do not occur in the data warehouse, i.e., update, insert, and delete
operations are not performed. It usually requires only two procedures in data
accessing: Initial loading of data and access to data. Therefore, the DW does
not require transaction processing, recovery, and concurrency capabilities,
which allows for substantial speedup of data retrieval. Non-Volatile defines
that once entered into the warehouse, and data should not change.
History of Data Warehouse
The idea of data warehousing came to the late 1980's when IBM researchers
Barry Devlin and Paul Murphy established the "Business Data Warehouse."
Production Data: This type of data comes from the different operating
systems of the enterprise. Based on the data requirements in the data
warehouse, we choose segments of the data from the various operational
modes.
We will now discuss the three primary functions that take place in the
staging area.
1) Data Extraction: This method has to deal with numerous data sources.
We have to employ the appropriate techniques for each data source.
2) Data Transformation: As we know, data for a data warehouse comes
from many different sources. If data extraction for a data warehouse posture
big challenges, data transformation present even significant challenges. We
perform several individual tasks as part of data transformation.
First, we clean the data extracted from each source. Cleaning may be the
correction of misspellings or may deal with providing default values for
missing data elements, or elimination of duplicates when we bring in the
same data from various source systems.
On the other hand, data transformation also contains purging source data
that is not useful and separating outsource records into new combinations.
Sorting and merging of data take place on a large scale in the data staging
area. When the data transformation function ends, we have a collection of
integrated data that is cleaned, standardized, and summarized.
Production Data: This type of data comes from the different operating
systems of the enterprise. Based on the data requirements in the data
warehouse, we choose segments of the data from the various operational
modes.
We will now discuss the three primary functions that take place in the
staging area.
1) Data Extraction: This method has to deal with numerous data sources.
We have to employ the appropriate techniques for each data source.
2) Data Transformation: As we know, data for a data warehouse comes
from many different sources. If data extraction for a data warehouse posture
big challenges, data transformation present even significant challenges. We
perform several individual tasks as part of data transformation.
First, we clean the data extracted from each source. Cleaning may be the
correction of misspellings or may deal with providing default values for
missing data elements, or elimination of duplicates when we bring in the
same data from various source systems.
On the other hand, data transformation also contains purging source data
that is not useful and separating outsource records into new combinations.
Sorting and merging of data take place on a large scale in the data staging
area. When the data transformation function ends, we have a collection of
integrated data that is cleaned, standardized, and summarized.
AD
AD
Metadata Component
Metadata in a data warehouse is equal to the data dictionary or the data
catalog in a database management system. In the data dictionary, we keep
the data about the logical data structures, the data about the records and
addresses, the information about the indexes, and so on.
Data Marts
It includes a subset of corporate-wide data that is of value to a specific group
of users. The scope is confined to particular selected subjects. Data in a data
warehouse should be a fairly current, but not mainly up to the minute,
although development in the data warehouse industry has made standard
and incremental data dumps more achievable. Data marts are lower than
data warehouses and usually contain organization. The current trends in
data warehousing are to developed a data warehouse with several smaller
related data marts for particular kinds of queries and reports.
AD
Data Warehouse is used for analysis and decision making in which extensive
database is required, including historical data, which operational database
does not typically maintain.
2. The tables and joins are complicated since they 2. The tables and joins are accessible
are normalized for RDBMS. This is done to reduce since they are de-normalized. This is
redundant files and to save storage space. done to minimize the response time for
analytical queries.
4. Entity: Relational modeling procedures are 4. Data: Modeling approach are used for
used for RDBMS database design. the Data Warehouse design.
7. The database is the place where the data is 7. Data Warehouse is the place where
taken as a base and managed to get available fast the application data is handled for
and efficient access. analysis and reporting objectives.
Data Warehouse and the OLTP database are both relational databases.
However, the goals of both these databases are different.
Operational systems are designed to support Data warehousing systems are typically
high-volume transaction processing. designed to support high-volume analytical
processing (i.e., OLAP).
Operational systems are usually concerned Data warehousing systems are usually
with current data. concerned with historical data.
Data within operational systems are mainly Non-volatile, new data may be added
updated regularly according to need. regularly. Once Added rarely changed.
It is optimized for a simple set of transactions, It is optimized for extent loads and high,
generally adding or retrieving a single row at a complex, unpredictable queries that access
time per table. many rows per table.
Operational systems are widely process- Data warehousing systems are widely
oriented. subject-oriented
Operational systems are usually optimized to Data warehousing systems are usually
perform fast inserts and updates of optimized to perform fast retrievals of
associatively small volumes of data. relatively high volumes of data.
Relational databases are created for on-line Data Warehouse designed for on-line
transactional Processing (OLTP) Analytical Processing (OLAP)
OLAP System
OLAP handle with Historical Data or Archival Data. Historical data are those
data that are achieved over a long period. For example, if we collect the last
10 years information about flight reservation, the data can give us much
meaningful data such as the trends in the reservation. This may provide
useful information like peak time of travel, what kind of people are traveling
in various classes (Economy/Business) etc.
The major difference between an OLTP and OLAP system is the amount of
data analyzed in a single transaction. Whereas an OLTP manage many
concurrent customers and queries touching only an individual record or
limited groups of files at a time. An OLAP system must have the capability to
operate on millions of files to answer a single query.
Data contents OLTP system manages current OLAP system manages a large amount of
data that too detailed and are historical data, provides facilitates for
used for decision making. summarization and aggregation, and stores
and manages data at different levels of
granularity. This information makes the data
more comfortable to use in informed
decision making.
Database OLTP system usually uses an OLAP system typically uses either a star or
design entity-relationship (ER) data snowflake model and subject-oriented
model and application-oriented database design.
database design.
View OLTP system focuses primarily OLAP system often spans multiple versions
on the current data within an of a database schema, due to the
enterprise or department, evolutionary process of an organization.
without referring to historical OLAP systems also deal with data that
information or data in different originates from various organizations,
organizations. integrating information from many data
stores.
Volume of Not very large Because of their large volume, OLAP data
data are stored on multiple storage media.
Access The access patterns of an Accesses to OLAP systems are mostly read-
patterns OLTP system subsist mainly of only methods because of these data
short, atomic transactions. warehouses stores historical data.
Such a system requires
concurrency control and
recovery techniques.
Insert and Short and fast inserts and Periodic long-running batch jobs refresh the
Updates updates proposed by end- data.
users.
Data Warehouse applications are designed to support the user ad-hoc data
requirements, an activity recently dubbed online analytical processing
(OLAP). These include applications such as forecasting, profiling, summary
reporting, and trend analysis.
Operational System
Flat Files
Meta Data
A set of data that defines and gives information about other data.
Meta Data summarizes necessary information about data, which can make
finding and work with particular instances of data more accessible. For
example, author, data build, and data changed, and file size are examples of
very basic document metadata.
The area of the data warehouse saves all the predefined lightly and highly
summarized (aggregated) data generated by the warehouse manager.
AD
AD
AD
W
The figure illustrates an example where purchasing, sales, and stocks are
separated. In this example, a financial analyst wants to analyze historical
data for purchases and sales or mine historical information to make
predictions about customer behavior.
AD
The figure shows the only layer physically available is the source layer. In
this method, data warehouses are virtual. This means that the data
warehouse is implemented as a multidimensional view of operational data
created by specific middleware, or an intermediate processing layer.
The vulnerability of this architecture lies in its failure to meet the
requirement for separation between analytical and transactional processing.
Analysis queries are agreed to operational data after the middleware
interprets them. In this way, queries affect transactional workloads.
Two-Tier Architecture
The requirement for separation plays an essential role in defining the two-
tier architecture for a data warehouse system, as shown in fig:
Although it is typically called two-layer architecture to highlight a separation
between physically available sources and data warehouses, in fact, consists
of four subsequent data flow stages:
Three-Tier Architecture
The three-tier architecture consists of the source layer (containing multiple
source system), the reconciled layer and the data warehouse layer
(containing both data warehouses and data marts). The reconciled layer sits
between the source data and data warehouse.