Data Warehouse: Concepts, Architecture and Components
Data Warehouse: Concepts, Architecture and Components
Definition:
Data warehouse is an information system that contains historical and commutative data from single or
multiple sources. It simplifies reporting and analysis process of the organization.
It is also a single version of truth for any company for decision making and forecasting.
Subject-Oriented
A data warehouse is subject oriented as it offers information regarding a theme instead of companies'
ongoing operations. These subjects can be sales, marketing, distributions, etc.
A data warehouse never focuses on the ongoing operations. Instead, it put emphasis on modeling and
analysis of data for decision making. It also provides a simple and concise view around the specific subject
by excluding data which not helpful to support the decision process.
Integrated
In Data Warehouse, integration means the establishment of a common unit of measure for all similar data
from the dissimilar database. The data also needs to be stored in the Datawarehouse in common and
universally acceptable manner.
A data warehouse is developed by integrating data from varied sources like a mainframe, relational
databases, flat files, etc. Moreover, it must keep consistent naming conventions, format, and coding.
This integration helps in effective analysis of data. Consistency in naming conventions, attribute measures,
encoding structure etc. have to be ensured. Consider the following example:
In the above example, there are three different application labeled A, B and C. Information stored in these
applications are Gender, Date, and Balance. However, each application's data is stored different way.
In Application A gender field store logical values like M or F
In Application B gender field is a numerical value,
In Application C application, gender field stored in the form of a character value.
Same is the case with Date and balance
However, after transformation and cleaning process all this data is stored in common format in the Data
Warehouse.
Time-Variant
The time horizon for data warehouse is quite extensive compared with operational systems. The data
collected in a data warehouse is recognized with a particular period and offers information from the
historical point of view. It contains an element of time, explicitly or implicitly.
One such place where Data warehouse data display time variance is in the structure of the record key. Every
primary key contained with the DW should have either implicitly or explicitly an element of time. Like the
day, week month, etc.
Another aspect of time variance is that once data is inserted in the warehouse, it can't be updated or changed.
Non-volatile
Data warehouse is also non-volatile means the previous data is not erased when new data is entered in it.
Data is read-only and periodically refreshed. This also helps to analyze historical data and understand what
& when happened. It does not require transaction process, recovery and concurrency control mechanisms.
Activities like delete, update, and insert which are performed in an operational application environment are
omitted in Data warehouse environment. Only two types of data operations performed in the Data
Warehousing are
1. Data loading
2. Data access
Single-tier architecture
The objective of a single layer is to minimize the amount of data stored. This goal is to remove data
redundancy. This architecture is not frequently used in practice.
Two-tier architecture
Two-layer architecture separates physically available sources and data warehouse. This architecture is not
expandable and also not supporting a large number of end-users. It also has connectivity problems because
of network limitations.
Three-tier architecture
This is the most widely used architecture.
It consists of the Top, Middle and Bottom Tier.
1. Bottom Tier: The database of the Datawarehouse servers as the bottom tier. It is usually a relational
database system. Data is cleansed, transformed, and loaded into this layer using back-end tools.
2. Middle Tier: The middle tier in Data warehouse is an OLAP server which is implemented using either
ROLAP or MOLAP model. For a user, this application tier presents an abstracted view of the
database. This layer also acts as a mediator between the end-user and the database.
3. Top-Tier: The top tier is a front-end client layer. Top tier is the tools and API that you connect and get
data out from the data warehouse. It could be Query tools, reporting tools, managed query tools,
Analysis tools and Data mining tools.
Datawarehouse Components
The data has been selected from various sources and then integrated and store the data in a single and
particular format
Data warehouse contains current detailed data, historical detailed data, lightly and highly summarized data,
and metadata.
Current and historical data: these are voluminous because they are stored at the highest level of detail.
Lightly and highly summarized data: are necessary to save processing time when users request them and
readily accessible.
Metadata: are “data about data”. It is important for designing, contructing, retrieving, and controlling the
warehouse data.