Data Warehousing
Data Warehousing
Data warehousing is the process of collecting, storing, and managing large volumes of data from
different sources into a centralized system, known as a data warehouse.
Data Warehouse is a storage of large amount of operational data ( data that document the everyday
operations of an organisation) gathered from multiple sources , stored under a unified schema at a
single site.
Subject-Oriented: Database is used to represent a process. Like payroll, accounting, etc. on the
other hand, a data warehouse is used to analyze a particular subject area. For example, “sales” .
“Sales” may further have dimensions.
Integrated: Original data available in different source systems is not integrated. A data warehouse
integrates data from these multiple data sources. For example, a customer may be identified using
two different keys at different data sources. Data warehouse must be able to integrate the two
source systems and identify customers on the basis of single key.
Time-Variant: Operational data represents only the current data, whereas data warehouse keeps all
the historical data as well. You can retrieve data for the last 3 months, 6 months, 12 months, or even
older data from a data warehouse.
Non-volatile: Only way to add data to a data warehouse is to extract data from source systems. The
data is used only for the analysis task and no changes are made to it. Historical data in a data are
house is never altered or deleted.
Components of Data Warehousing:
Source data coming into the data warehouses may be grouped into four broad categories:
Production Data: It is the data that comes from various operational system of an enterprise
(all the day to day operation data ).
Internal Data: It includes "private" spreadsheets, reports, customer profiles, and sometimes
even department databases
Archived Data: It is the old or historical data . In every operational system, we periodically
take the old data and store it in achieved files.
External Data: It includes statistical data of their industry produced by the external
department.
After we have been extracted data from various operational systems and external sources, we have
to prepare the files for storing in the data warehouse. The extracted data coming from several
different sources need to be changed, converted, and made ready in a format that is relevant to be
saved for querying and analysis.
We will now discuss the three primary functions that take place in the staging area.
Data Extraction: Data is extracted from various sources.
Data Transformation:
First, we clean the data extracted from each source. Cleaning may be the correction of
misspellings or may deal with providing default values for missing data elements, or
elimination of duplicates when we bring in the same data from various source systems.
Then , Data standardization is performed. In this , Data is combined to single source from
many source records.
Data Loading: Then there is loading of the information into the data warehouse storage .
Data storage for the data warehousing is a split repository. Data storage is done on three levels:
Metadata
Metadata is data that describes other data. In a data warehouse, it provides
information about the data's origin, structure, format, and how it is used.
Data Marts
A data mart is a subset of the data warehouse that is focused on a specific business
area or department, such as sales, marketing, or finance.
Multidimensional Database
For the analysis purpose data is stored in various multidimensional database
The information delivery element is used to enable the process of fetching of data warehouse files
and transferred to one or more destinations .
The management and control elements coordinate the services and functions within the data
warehouse. These components control the data transformation and the data transfer into the data
warehouse storage. On the other hand, it controls the data delivery to the clients. Its work with the
database management systems and authorizes data to be correctly saved in the repositories. It
monitors the movement of information into the staging method and from there into the data
warehouses storage itself.
Complexity increase with size
1. Improved Decision-Making
o Data warehouses store historical data, allowing businesses to track trends over time,
compare past and present performance, and perform long-term data analysis.
o Through data cleansing and transformation processes, data warehouses improve the
quality and consistency of data, reducing errors and discrepancies between different
systems.
5. Data Integration
o Data warehouses consolidate data from multiple sources (e.g., ERP, CRM, social
media), offering a comprehensive view of an organization’s operations, improving
cross-functional analysis.
o Integrating data from multiple sources, especially if they are in different formats
(structured, unstructured), can be time-consuming and complex, requiring advanced
data transformation efforts.
o Data warehouses need regular maintenance, updates, and scaling as data volumes
grow, leading to ongoing operational costs in terms of both resources and
personnel.
4. Data Latency