Data Warehouse Unit1 CS3551
Data Warehouse Unit1 CS3551
Subject-Oriented
A data warehouse target on the modelling and analysis of data for decision-makers.
Therefore, data warehouses typically provide a concise and straightforward view around a
particular subject, such as customer, product, or sales, instead of the global organization's on-
going operations. This is done by excluding data that are not useful concerning the subject
and including all data needed by the users to understand the subject.
Integrated
A data warehouse integrates various heterogeneous data sources like RDBMS, flat files, and
online transaction records. It requires performing data cleaning and integration during data
warehousing to ensure consistency in naming conventions, attributes types, etc., among
different data sources.
Time-Variant
Historical information is kept in a data warehouse. For example, one can retrieve files from 3
months, 6 months, 12 months, or even previous data from a data warehouse. These variations
with a transactions system, where often only the most current file is kept.
Non-Volatile
The data warehouse is a physically separate data storage, which is transformed from the
source operational RDBMS. The operational updates of data do not occur in the data
warehouse, i.e., update, insert, and delete operations are not performed. It usually requires
only two procedures in data accessing: Initial loading of data and access to data. Therefore,
the DW does not require transaction processing, recovery, and concurrency capabilities,
which allows for substantial speedup of data retrieval. Non-Volatile defines that once entered
into the warehouse, and data should not change.
1. 1) Business User: Business users require a data warehouse to view summarized data
from the past. Since these people are non-technical, the data may be presented to them
in an elementary form.
2. 2) Store historical data: Data Warehouse is required to store the time variable data
from the past. This input is made to be used for various purposes.
3. 3) Make strategic decisions: Some strategies may be depending upon the data in the
data warehouse. So, data warehouse contributes to making strategic decisions.
4. 4) For data consistency and quality: Bringing the data from different sources at a
commonplace, the user can effectively undertake to bring the uniformity and
consistency in data.
5. 5) High response time: Data warehouse has to be ready for somewhat unexpected
loads and types of queries, which demands a significant degree of flexibility and
quick response time.
Benefits of Data Warehouse
1. Understand business trends and make better forecasting decisions.
2. Data Warehouses are designed to perform well enormous amounts of data.
3. The structure of data warehouses is more accessible for end-users to navigate,
understand, and query.
4. Queries that would be complex in many normalized databases could be easier to build
and maintain in data warehouses.
5. Data warehousing is an efficient method to manage demand for lots of information
from lots of users.
6. Data warehousing provide the capabilities to analyze a large amount of historical data.
Components or Building Blocks of Data Warehouse
Architecture is the proper arrangement of the elements. We build a data warehouse with
software and hardware components. To suit the requirements of our organizations, we
arrange these building we may want to boost up another part with extra tools and services.
All of these depends on our circumstances.
The figure shows the essential elements of a typical warehouse. We see the Source Data
component shows on the left. The Data staging element serves as the next building block. In
the middle, we see the Data Storage component that handles the data warehouses data. This
element not only stores and manages the data; it also keeps track of data using the metadata
repository. The Information Delivery component shows on the right consists of all the
different ways of making the information from the data warehouses available to the users.
Source Data Component
Source data coming into the data warehouses may be grouped into four broad categories:
Production Data: This type of data comes from the different operating systems of the
enterprise. Based on the data requirements in the data warehouse, we choose segments of the
data from the various operational modes.
Internal Data: In each organization, the client keeps their "private" spreadsheets, reports,
customer profiles, and sometimes even department databases. This is the internal data, part of
which could be useful in a data warehouse.
Archived Data: Operational systems are mainly intended to run the current business. In
every operational system, we periodically take the old data and store it in achieved files.
External Data: Most executives depend on information from external sources for a large
percentage of the information they use. They use statistics associating to their industry
produced by the external department.
Data Staging Component
After we have been extracted data from various operational systems and external sources, we
have to prepare the files for storing in the data warehouse. The extracted data coming from
several different sources need to be changed, converted, and made ready in a format that is
relevant to be saved for querying and analysis.
We will now discuss the three primary functions that take place in the staging area.
1) Data Extraction: This method has to deal with numerous data sources. We have to
employ the appropriate techniques for each data source.
2) Data Transformation: As we know, data for a data warehouse comes from many
different sources. If data extraction for a data warehouse posture big challenges, data
transformation present even significant challenges. We perform several individual tasks as
part of data transformation.
First, we clean the data extracted from each source. Cleaning may be the correction of
misspellings or may deal with providing default values for missing data elements, or
elimination of duplicates when we bring in the same data from various source systems.
Standardization of data components forms a large part of data transformation. Data
transformation contains many forms of combining pieces of data from different sources. We
combine data from single source record or related data parts from many source records.
On the other hand, data transformation also contains purging source data that is not useful
and separating outsource records into new combinations. Sorting and merging of data take
place on a large scale in the data staging area. When the data transformation function ends,
we have a collection of integrated data that is cleaned, standardized, and summarized.
3) Data Loading: Two distinct categories of tasks form data loading functions. When we
complete the structure and construction of the data warehouse and go live for the first time,
we do the initial loading of the information into the data warehouse storage. The initial load
moves high volumes of data using up a substantial amount of time.
Data Storage Components
Data storage for the data warehousing is a split repository. The data repositories for the
operational systems generally include only the current data. Also, these data repositories
include the data structured in highly normalized for fast and efficient processing.
Information Delivery Component
The information delivery element is used to enable the process of subscribing for data
warehouse files and having it transferred to one or more destinations according to some
customer-specified scheduling algorithm.
Metadata Component
Metadata in a data warehouse is equal to the data dictionary or the data catalog in a database
management system. In the data dictionary, we keep the data about the logical data structures,
the data about the records and addresses, the information about the indexes, and so on.
Data Marts
It includes a subset of corporate-wide data that is of value to a specific group of users. The
scope is confined to particular selected subjects. Data in a data warehouse should be a fairly
current, but not mainly up to the minute, although development in the data warehouse
industry has made standard and incremental data dumps more achievable. Data marts are
lower than data warehouses and usually contain organization. The current trends in data
warehousing are to developed a data warehouse with several smaller related data marts for
particular kinds of queries and reports.
Management and Control Component
The management and control elements coordinate the services and functions within the data
warehouse. These components control the data transformation and the data transfer into the
data warehouse storage. On the other hand, it moderates the data delivery to the clients. Its
work with the database management systems and authorizes data to be correctly saved in the
repositories. It monitors the movement of information into the staging method and from there
into the data warehouses storage itself.
Why we need a separate Data Warehouse?
Data Warehouse queries are complex because they involve the computation of large groups
of data at summarized levels.
It may require the use of distinctive data organization, access, and implementation method
based on multidimensional views.
Performing OLAP queries in operational database degrade the performance of functional
tasks.
Data Warehouse is used for analysis and decision making in which extensive database is
required, including historical data, which operational database does not typically maintain.
The separation of an operational database from data warehouses is based on the different
structures and uses of data in these systems.
Because the two systems provide different functionalities and require different kinds of data,
it is necessary to maintain separate databases.
Difference between Database and Data Warehouse
1. It is used for Online Transactional Processing 1. It is used for Online Analytical Processing
(OLTP) but can be used for other objectives such as (OLAP). This reads the historical
Data Warehousing. This records the data from the information for the customers for business
clients for history. decisions.
2. The tables and joins are complicated since they are 2. The tables and joins are accessible since
normalized for RDBMS. This is done to reduce they are de-normalized. This is done to
redundant files and to save storage space. minimize the response time for analytical
queries.
4. Entity: Relational modeling procedures are used for 4. Data: Modeling approach are used for the
RDBMS database design. Data Warehouse design.
6. Performance is low for analysis queries. 6. High performance for analytical queries.
7. The database is the place where the data is taken as 7. Data Warehouse is the place where the
a base and managed to get available fast and efficient application data is handled for analysis and
access. reporting objectives.
Difference between Operational Database and Data Warehouse
The Operational Database is the source of information for the data warehouse. It includes
detailed information used to run the day to day operations of the business. The data
frequently changes as updates are made and reflect the current value of the last transactions.
Operational Database Management Systems also called as OLTP (Online Transactions
Processing Databases), are used to manage dynamic data in real-time.
Data Warehouse Systems serve users or knowledge workers in the purpose of data analysis
and decision-making. Such systems can organize and present information in specific formats
to accommodate the diverse needs of various users. These systems are called as Online-
Analytical Processing (OLAP) Systems.
Data Warehouse and the OLTP database are both relational databases. However, the goals of
both these databases are different.
Operational Database Data Warehouse
Operational systems are designed to support high- Data warehousing systems are typically
volume transaction processing. designed to support high-volume analytical
processing (i.e., OLAP).
Operational systems are usually concerned with Data warehousing systems are usually
current data. concerned with historical data.
Data within operational systems are mainly Non-volatile, new data may be added regularly.
updated regularly according to need. Once Added rarely changed.
It is designed for real-time business dealing and It is designed for analysis of business measures
processes. by subject area, categories, and attributes.
It is optimized for a simple set of transactions, It is optimized for extent loads and high,
generally adding or retrieving a single row at a time complex, unpredictable queries that access
per table. many rows per table.
Operational systems are widely process-oriented. Data warehousing systems are widely subject-
oriented
Operational systems are usually optimized to Data warehousing systems are usually
perform fast inserts and updates of associatively optimized to perform fast retrievals of relatively
small volumes of data. high volumes of data.
Relational databases are created for on-line Data Warehouse designed for on-line Analytical
transactional Processing (OLTP) Processing (OLAP)
OLTP System
OLTP System handle with operational data. Operational data are those data contained in the
operation of a particular system. Example, ATM transactions and Bank transactions, etc.
OLAP System
OLAP handle with Historical Data or Archival Data. Historical data are those data that are
achieved over a long period. For example, if we collect the last 10 years information about
flight reservation, the data can give us much meaningful data such as the trends in the
reservation. This may provide useful information like peak time of travel, what kind of
people are traveling in various classes (Economy/Business) etc.
The major difference between an OLTP and OLAP system is the amount of data analyzed in
a single transaction. Whereas an OLTP manage many concurrent customers and queries
touching only an individual record or limited groups of files at a time. An OLAP system must
have the capability to operate on millions of files to answer a single query.
Data contents OLTP system manages current OLAP system manages a large amount of
data that too detailed and are historical data, provides facilitates for
used for decision making. summarization and aggregation, and stores and
manages data at different levels of granularity.
This information makes the data more
comfortable to use in informed decision
making.
Database OLTP system usually uses an OLAP system typically uses either a star or
design entity-relationship (ER) data snowflake model and subject-oriented database
model and application-oriented design.
database design.
View OLTP system focuses primarily OLAP system often spans multiple versions of
on the current data within an a database schema, due to the evolutionary
enterprise or department, without process of an organization. OLAP systems also
referring to historical information deal with data that originates from various
or data in different organizations. organizations, integrating information from
many data stores.
Volume of data Not very large Because of their large volume, OLAP data are
stored on multiple storage media.
Access patterns The access patterns of an OLTP Accesses to OLAP systems are mostly read-
system subsist mainly of short, only methods because of these data warehouses
atomic transactions. Such a stores historical data.
system requires concurrency
control and recovery techniques.
Insert and Short and fast inserts and updates Periodic long-running batch jobs refresh the
Updates proposed by end-users. data.
1. Business User
2. Store historical data
3. Make strategic decisions
4. For data consistency and quality
5. High response time
Operational System
An operational system is a method used in data warehousing to refer to a system that is used to process
the day-to-day transactions of an organization.
Flat Files
A Flat file system is a system of files in which transactional data is stored, and every file in the system
must have a different name.
Meta Data
A set of data that defines and gives information about other data.
Meta Data used in Data Warehouse for a variety of purpose, including:
Meta Data summarizes necessary information about data, which can make finding and work with
particular instances of data more accessible. For example, author, data build, and data changed, and file
size are examples of very basic document metadata.
Metadata is used to direct a query to the most appropriate data source.
Lightly and highly summarized data
The area of the data warehouse saves all the predefined lightly and highly summarized (aggregated) data
generated by the warehouse manager.
The goals of the summarized information are to speed up query performance. The summarized record is
updated continuously as new information is loaded into the warehouse.
End-User access Tools
The principal purpose of a data warehouse is to provide information to the business managers for
strategic decision-making. These customers interact with the warehouse using end-client access tools.
The examples of some of the end-user access tools can be:
o Reporting and Query Tools
o Application Development Tools
o Executive Information Systems Tools
o Online Analytical Processing Tools
o Data Mining Tools
Data Warehouse Architecture: With Staging Area
We must clean and process your operational information before put it into the warehouse.
W
e can do this programmatically, although data warehouses uses a staging area (A place where data is
processed before entering the warehouse).
A staging area simplifies data cleansing and consolidation for operational method coming from multiple
source systems, especially for enterprise data warehouses where all relevant data of an enterprise is
consolidated.
Data Warehouse Staging Area is a temporary location where a record from source systems is copied.
Data Warehouse Architecture: With Staging Area and Data Marts
We may want to customize our warehouse's architecture for multiple groups within our organization.
We can do this by adding data marts. A data mart is a segment of a data warehouses that can provided
information for reporting and analysis on a section, unit, department or operation in the company, e.g.,
sales, payroll, production, etc.
The figure illustrates an example where purchasing, sales, and stocks are separated. In this example, a
financial analyst wants to analyze historical data for purchases and sales or mine historical information
to make predictions about customer behavior.
1. Separation: Analytical and transactional processing should be keep apart as much as possible.
2. Scalability: Hardware and software architectures should be simple to upgrade the data volume, which
has to be managed and processed, and the number of user's requirements, which have to be met,
progressively increase.
3. Extensibility: The architecture should be able to perform new operations and technologies without
redesigning the whole system.
4. Security: Monitoring accesses are necessary because of the strategic data stored in the data
warehouses.
5. Administerability: Data Warehouse management should not be complicated.
Types of Data Warehouse Architectures
Single-Tier Architecture
Single-Tier architecture is not periodically used in practice. Its purpose is to minimize the amount of
data stored to reach this goal; it removes data redundancies.
The figure shows the only layer physically available is the source layer. In this method, data warehouses
are virtual. This means that the data warehouse is implemented as a multidimensional view of
operational data created by specific middleware, or an intermediate processing layer.
The vulnerability of this architecture lies in its failure to meet the requirement for separation between
analytical and transactional processing. Analysis queries are agreed to operational data after the
middleware interprets them. In this way, queries affect transactional workloads.
Two-Tier Architecture
The requirement for separation plays an essential role in defining the two-tier architecture for a data
warehouse system, as shown in fig:
A middle-tier which consists of an OLAP server for fast querying of the data warehouse.
The OLAP server is implemented using either
(1) A Relational OLAP (ROLAP) model, i.e., an extended relational DBMS that maps functions on
multidimensional data to standard relational operations.
(2) A Multidimensional OLAP (MOLAP) model, i.e., a particular purpose server that directly
implements multidimensional information and operations.
A top-tier that contains front-end tools for displaying results provided by OLAP, as well as additional
tools for data mining of the OLAP-generated data.
The overall Data Warehouse Architecture is shown in fig:
The metadata repository stores information that defines DW objects. It includes the following
parameters and information for the middle and the top-tier applications:
1. A description of the DW structure, including the warehouse schema, dimension, hierarchies, data
mart locations, and contents, etc.
2. Operational metadata, which usually describes the currency level of the stored data, i.e., active,
archived or purged, and warehouse monitoring information, i.e., usage statistics, error reports,
audit, etc.
3. System performance data, which includes indices, used to improve data access and retrieval
performance.
4. Information about the mapping from operational databases, which provides source RDBMSs and
their contents, cleaning and transformation rules, etc.
5. Summarization algorithms, predefined queries, and reports business data, which include business
terms and definitions, ownership information, etc.
Principles of Data Warehousing
Load Performance
Data warehouses require increase loading of new data periodically basis within narrow time windows;
performance on the load process should be measured in hundreds of millions of rows and gigabytes per
hour and must not artificially constrain the volume of data business.
Load Processing
Many phases must be taken to load new or update data into the data warehouse, including data
conversion, filtering, reformatting, indexing, and metadata update.
Data Quality Management
Fact-based management demands the highest data quality. The warehouse ensures local consistency,
global consistency, and referential integrity despite "dirty" sources and massive database size.
Query Performance
Fact-based management must not be slowed by the performance of the data warehouse RDBMS; large,
complex queries must be complete in seconds, not days.
Terabyte Scalability
Data warehouse sizes are growing at astonishing rates. Today these size from a few to hundreds of
gigabytes and terabyte-sized data warehouses.
Snowflake is a cloud-based Data Warehouse solution provided as Saas (Software-as-a-Service) with full
ANSI SQL support. It also has a unique structure that allows users to simply create tables and start
query data with very little management or DBA tasks required
Snowflake Architecture
Snowflake architecture contains a combination of standard shared disk and unallocated formats to
provide the best for both. Let’s go through these buildings and see how Snowflake integrates them into a
Used on a standard website, shared disk architecture has a single storage layer accessible to all cluster
nodes. Many cluster nodes with CPU and Memory without disk storage themselves connect to the
In contrast to the Shared-Disk architecture, Shared-Nothing architecture distributed cluster nodes and
disk storage, its CPU, and Memory. The advantage here is that data can be categorized and stored across
all cluster nodes as each cluster node has its own disk storage.
The snowflake supports high-level formation as shown in the diagram below. Snowflake has 3 different
layers:
1. Storage Layout
2. Computer Layer
1. Storage Layout
Snowflake organizes data into many smaller compartments that are internalized and compressed. Uses
column format to save. Data is stored in cloud storage and acts as a shared disk model thus providing
ease of data management. This ensures that users do not have to worry about data distribution across all
Calculation notes connect to the storage layer to download query processing data. Since the storage
layer is independent, we only pay for the monthly storage amount used. As Snowflake is offered on the
2. Computer Layer
Snowflake uses the “Virtual Warehouse” (described below) to answer questions. Snowflake splits the
query processing layer into disk storage. It uses queries in this layer using data from the storage layer.
Virtual Warehouses MPP compiles include multiple nodes with CPU and Memory provided in the cloud
by Snowflake. Multiple Virtual Warehouses can be created on Snowflake for a variety of needs
depending on the workload. Each visible warehouse can operate on a single layer of storage. Typically,
the visible Warehouse has its own independent computer collection and does not interact with other
warehouses.
Cloud data warehouses also offer automatic, near-real-time scalability and greater
system reliability and uptime than on-premises hardware, and transparent billing,
which allows enterprises to pay only for what they use.
Because cloud data warehouses don't rely on the rigid structures and data modeling
concepts inherent in traditional systems, they have diverse architectures.
• The Google BigQuery approach is more like software-as-a-service (SaaS) that allows
interactive analysis of big data. It can be used alongside Google Cloud Storage and
technologies such as MapReduce. BigQuery differentiates itself with a serverless
architecture, which means users cannot see details of resource allocation, as computational
and storage provisioning happens continuously and dynamically.
• Microsoft Azure SQL Data Warehouse is an elastic, large-scale data warehouse PaaS that
leverages the broad ecosystem of SQL Server. Like other cloud storage and computing
platforms, it uses a distributed, MPP architecture and columnar data store. It gathers data
from databases and SaaS platforms into one powerful, fully-managed centralized repository.
• A snowflake schema arranges tables and their connections so that a representative entity
relationship diagram (ERD) resembles a snowflake. A centralized fact table connects to
many dimension tables, which themselves connect to more dimension tables, and so on.
Data is normalized.