0% found this document useful (0 votes)
38 views25 pages

Data Warehouse Unit1 CS3551

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
38 views25 pages

Data Warehouse Unit1 CS3551

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 25

UNIT I

INTRODUCTION TO DATA WAREHOUSE

Data warehouse Introduction - Data warehouse components- operational database Vs


data warehouse – Data warehouse Architecture – Three-tier Data Warehouse
Architecture - Autonomous Data Warehouse- Autonomous Data Warehouse Vs
Snowflake - Modern Data Warehouse

What is a Data Warehouse?


A Data Warehouse (DW) is a relational database that is designed for query and analysis
rather than transaction processing. It includes historical data derived from transaction data
from single and multiple sources.
A Data Warehouse provides integrated, enterprise-wide, historical data and focuses on
providing support for decision-makers for data modeling and analysis.
A Data Warehouse is a group of data specific to the entire organization, not only to a
particular group of users.
It is not used for daily operations and transaction processing but used for making decisions.
A Data Warehouse can be viewed as a data system with the following attributes:
o It is a database designed for investigative tasks, using data from various applications.
o It supports a relatively small number of clients with relatively long interactions.
o It includes current and historical data to provide a historical perspective of
information.
o Its usage is read-intensive.
o It contains a few large tables.
"Data Warehouse is a subject-oriented, integrated, and time-variant store of information in
support of management's decisions."
Characteristics of Data Warehouse

Subject-Oriented
A data warehouse target on the modelling and analysis of data for decision-makers.
Therefore, data warehouses typically provide a concise and straightforward view around a
particular subject, such as customer, product, or sales, instead of the global organization's on-
going operations. This is done by excluding data that are not useful concerning the subject
and including all data needed by the users to understand the subject.
Integrated
A data warehouse integrates various heterogeneous data sources like RDBMS, flat files, and
online transaction records. It requires performing data cleaning and integration during data
warehousing to ensure consistency in naming conventions, attributes types, etc., among
different data sources.

Time-Variant
Historical information is kept in a data warehouse. For example, one can retrieve files from 3
months, 6 months, 12 months, or even previous data from a data warehouse. These variations
with a transactions system, where often only the most current file is kept.

Non-Volatile
The data warehouse is a physically separate data storage, which is transformed from the
source operational RDBMS. The operational updates of data do not occur in the data
warehouse, i.e., update, insert, and delete operations are not performed. It usually requires
only two procedures in data accessing: Initial loading of data and access to data. Therefore,
the DW does not require transaction processing, recovery, and concurrency capabilities,
which allows for substantial speedup of data retrieval. Non-Volatile defines that once entered
into the warehouse, and data should not change.

History of Data Warehouse


The idea of data warehousing came to the late 1980's when IBM researchers Barry Devlin
and Paul Murphy established the "Business Data Warehouse."
In essence, the data warehousing idea was planned to support an architectural model for the
flow of information from the operational system to decisional support environments. The
concept attempt to address the various problems associated with the flow, mainly the high
costs associated with it.
In the absence of data warehousing architecture, a vast amount of space was required to
support multiple decision support environments. In large corporations, it was ordinary for
various decision support environments to operate independently.
Goals of Data Warehousing
o To help reporting as well as analysis
o Maintain the organization's historical information
o Be the foundation for decision making.
Need for Data Warehouse
Data Warehouse is needed for the following reasons:

1. 1) Business User: Business users require a data warehouse to view summarized data
from the past. Since these people are non-technical, the data may be presented to them
in an elementary form.
2. 2) Store historical data: Data Warehouse is required to store the time variable data
from the past. This input is made to be used for various purposes.
3. 3) Make strategic decisions: Some strategies may be depending upon the data in the
data warehouse. So, data warehouse contributes to making strategic decisions.
4. 4) For data consistency and quality: Bringing the data from different sources at a
commonplace, the user can effectively undertake to bring the uniformity and
consistency in data.
5. 5) High response time: Data warehouse has to be ready for somewhat unexpected
loads and types of queries, which demands a significant degree of flexibility and
quick response time.
Benefits of Data Warehouse
1. Understand business trends and make better forecasting decisions.
2. Data Warehouses are designed to perform well enormous amounts of data.
3. The structure of data warehouses is more accessible for end-users to navigate,
understand, and query.
4. Queries that would be complex in many normalized databases could be easier to build
and maintain in data warehouses.
5. Data warehousing is an efficient method to manage demand for lots of information
from lots of users.
6. Data warehousing provide the capabilities to analyze a large amount of historical data.
Components or Building Blocks of Data Warehouse
Architecture is the proper arrangement of the elements. We build a data warehouse with
software and hardware components. To suit the requirements of our organizations, we
arrange these building we may want to boost up another part with extra tools and services.
All of these depends on our circumstances.

The figure shows the essential elements of a typical warehouse. We see the Source Data
component shows on the left. The Data staging element serves as the next building block. In
the middle, we see the Data Storage component that handles the data warehouses data. This
element not only stores and manages the data; it also keeps track of data using the metadata
repository. The Information Delivery component shows on the right consists of all the
different ways of making the information from the data warehouses available to the users.
Source Data Component
Source data coming into the data warehouses may be grouped into four broad categories:
Production Data: This type of data comes from the different operating systems of the
enterprise. Based on the data requirements in the data warehouse, we choose segments of the
data from the various operational modes.
Internal Data: In each organization, the client keeps their "private" spreadsheets, reports,
customer profiles, and sometimes even department databases. This is the internal data, part of
which could be useful in a data warehouse.
Archived Data: Operational systems are mainly intended to run the current business. In
every operational system, we periodically take the old data and store it in achieved files.
External Data: Most executives depend on information from external sources for a large
percentage of the information they use. They use statistics associating to their industry
produced by the external department.
Data Staging Component
After we have been extracted data from various operational systems and external sources, we
have to prepare the files for storing in the data warehouse. The extracted data coming from
several different sources need to be changed, converted, and made ready in a format that is
relevant to be saved for querying and analysis.
We will now discuss the three primary functions that take place in the staging area.

1) Data Extraction: This method has to deal with numerous data sources. We have to
employ the appropriate techniques for each data source.
2) Data Transformation: As we know, data for a data warehouse comes from many
different sources. If data extraction for a data warehouse posture big challenges, data
transformation present even significant challenges. We perform several individual tasks as
part of data transformation.
First, we clean the data extracted from each source. Cleaning may be the correction of
misspellings or may deal with providing default values for missing data elements, or
elimination of duplicates when we bring in the same data from various source systems.
Standardization of data components forms a large part of data transformation. Data
transformation contains many forms of combining pieces of data from different sources. We
combine data from single source record or related data parts from many source records.
On the other hand, data transformation also contains purging source data that is not useful
and separating outsource records into new combinations. Sorting and merging of data take
place on a large scale in the data staging area. When the data transformation function ends,
we have a collection of integrated data that is cleaned, standardized, and summarized.
3) Data Loading: Two distinct categories of tasks form data loading functions. When we
complete the structure and construction of the data warehouse and go live for the first time,
we do the initial loading of the information into the data warehouse storage. The initial load
moves high volumes of data using up a substantial amount of time.
Data Storage Components
Data storage for the data warehousing is a split repository. The data repositories for the
operational systems generally include only the current data. Also, these data repositories
include the data structured in highly normalized for fast and efficient processing.
Information Delivery Component
The information delivery element is used to enable the process of subscribing for data
warehouse files and having it transferred to one or more destinations according to some
customer-specified scheduling algorithm.

Metadata Component
Metadata in a data warehouse is equal to the data dictionary or the data catalog in a database
management system. In the data dictionary, we keep the data about the logical data structures,
the data about the records and addresses, the information about the indexes, and so on.
Data Marts
It includes a subset of corporate-wide data that is of value to a specific group of users. The
scope is confined to particular selected subjects. Data in a data warehouse should be a fairly
current, but not mainly up to the minute, although development in the data warehouse
industry has made standard and incremental data dumps more achievable. Data marts are
lower than data warehouses and usually contain organization. The current trends in data
warehousing are to developed a data warehouse with several smaller related data marts for
particular kinds of queries and reports.
Management and Control Component
The management and control elements coordinate the services and functions within the data
warehouse. These components control the data transformation and the data transfer into the
data warehouse storage. On the other hand, it moderates the data delivery to the clients. Its
work with the database management systems and authorizes data to be correctly saved in the
repositories. It monitors the movement of information into the staging method and from there
into the data warehouses storage itself.
Why we need a separate Data Warehouse?
Data Warehouse queries are complex because they involve the computation of large groups
of data at summarized levels.
It may require the use of distinctive data organization, access, and implementation method
based on multidimensional views.
Performing OLAP queries in operational database degrade the performance of functional
tasks.
Data Warehouse is used for analysis and decision making in which extensive database is
required, including historical data, which operational database does not typically maintain.
The separation of an operational database from data warehouses is based on the different
structures and uses of data in these systems.
Because the two systems provide different functionalities and require different kinds of data,
it is necessary to maintain separate databases.
Difference between Database and Data Warehouse

Database Data Warehouse

1. It is used for Online Transactional Processing 1. It is used for Online Analytical Processing
(OLTP) but can be used for other objectives such as (OLAP). This reads the historical
Data Warehousing. This records the data from the information for the customers for business
clients for history. decisions.

2. The tables and joins are complicated since they are 2. The tables and joins are accessible since
normalized for RDBMS. This is done to reduce they are de-normalized. This is done to
redundant files and to save storage space. minimize the response time for analytical
queries.

3. Data is dynamic 3. Data is largely static

4. Entity: Relational modeling procedures are used for 4. Data: Modeling approach are used for the
RDBMS database design. Data Warehouse design.

5. Optimized for write operations. 5. Optimized for read operations.

6. Performance is low for analysis queries. 6. High performance for analytical queries.

7. The database is the place where the data is taken as 7. Data Warehouse is the place where the
a base and managed to get available fast and efficient application data is handled for analysis and
access. reporting objectives.
Difference between Operational Database and Data Warehouse

The Operational Database is the source of information for the data warehouse. It includes
detailed information used to run the day to day operations of the business. The data
frequently changes as updates are made and reflect the current value of the last transactions.
Operational Database Management Systems also called as OLTP (Online Transactions
Processing Databases), are used to manage dynamic data in real-time.
Data Warehouse Systems serve users or knowledge workers in the purpose of data analysis
and decision-making. Such systems can organize and present information in specific formats
to accommodate the diverse needs of various users. These systems are called as Online-
Analytical Processing (OLAP) Systems.
Data Warehouse and the OLTP database are both relational databases. However, the goals of
both these databases are different.
Operational Database Data Warehouse

Operational systems are designed to support high- Data warehousing systems are typically
volume transaction processing. designed to support high-volume analytical
processing (i.e., OLAP).

Operational systems are usually concerned with Data warehousing systems are usually
current data. concerned with historical data.

Data within operational systems are mainly Non-volatile, new data may be added regularly.
updated regularly according to need. Once Added rarely changed.

It is designed for real-time business dealing and It is designed for analysis of business measures
processes. by subject area, categories, and attributes.

It is optimized for a simple set of transactions, It is optimized for extent loads and high,
generally adding or retrieving a single row at a time complex, unpredictable queries that access
per table. many rows per table.

It is optimized for validation of incoming Loaded with consistent, valid information,


information during transactions, uses validation requires no real-time validation.
data tables.
It supports thousands of concurrent clients. It supports a few concurrent clients relative to
OLTP.

Operational systems are widely process-oriented. Data warehousing systems are widely subject-
oriented

Operational systems are usually optimized to Data warehousing systems are usually
perform fast inserts and updates of associatively optimized to perform fast retrievals of relatively
small volumes of data. high volumes of data.

Data In Data Out

Less Number of data accessed. Large Number of data accessed.

Relational databases are created for on-line Data Warehouse designed for on-line Analytical
transactional Processing (OLTP) Processing (OLAP)

Autonomous Data Warehouse


Oracle Cloud provides a set of data management services built on self-driving Oracle
Autonomous Database technology to deliver automated patching, upgrades, and tuning,
including performing all routine database maintenance tasks while the system is running,
without human intervention.
Oracle recently announced an update to its Autonomous Data Warehouse
(ADW) service. The update positions the company to gain market share
against its cloud rivals in the competitive cloud data warehouse (CDW)
space. In this piece, I will detail some of these updates and opine on how
they position Oracle moving forward.

Autonomous Data Warehouse (ADW) is a cloud-based data warehousing service


offered by Oracle Corporation. It is designed to provide a fully managed, self-driving,
and self-securing data warehouse solution for businesses.

Key features and benefits of Autonomous Data Warehouse include:

1. Automated Management: ADW is self-driving, which means it uses artificial


intelligence and machine learning algorithms to automatically handle tasks like
provisioning, scaling, tuning, backup, and security. This reduces the need for manual
intervention and allows data professionals to focus on deriving insights from the data
rather than managing the infrastructure.
2. Performance and Scalability: ADW is designed to deliver high performance for
analytical workloads. It uses a combination of in-memory processing, columnar
storage, and parallel processing to achieve fast query performance. Additionally, it
can automatically scale resources up or down based on workload demands, ensuring
optimal performance without manual intervention.
3. Security: ADW incorporates robust security features, including automatic encryption
of data at rest and in transit. It also employs AI-based threat detection and mitigation
to protect against unauthorized access and potential breaches.
4. Flexibility: ADW supports a variety of data types and data sources, enabling
businesses to consolidate and analyze data from different sources. It supports
structured and semi-structured data, making it suitable for a wide range of analytical
use cases.
5. Ease of Use: The service is designed to be user-friendly, even for those who may not
have extensive experience in data warehousing. It offers tools for data loading, data
modeling, and querying that are intuitive and easy to use.
6. Pay-as-You-Go Pricing: ADW follows a subscription-based pricing model, allowing
organizations to pay for the resources they consume on a monthly basis. This can
help in cost optimization and scaling resources as needed.
7. Integration with Oracle Ecosystem: ADW can seamlessly integrate with other
Oracle Cloud services, as well as on-premises Oracle databases. This allows
businesses to create hybrid solutions that span both cloud and on-premises
environments.
8. Analytics and Reporting: ADW is well-suited for complex analytics and reporting
tasks, making it a valuable tool for business intelligence and data analytics teams.

Difference between OLTP and OLAP

OLTP System
OLTP System handle with operational data. Operational data are those data contained in the
operation of a particular system. Example, ATM transactions and Bank transactions, etc.
OLAP System
OLAP handle with Historical Data or Archival Data. Historical data are those data that are
achieved over a long period. For example, if we collect the last 10 years information about
flight reservation, the data can give us much meaningful data such as the trends in the
reservation. This may provide useful information like peak time of travel, what kind of
people are traveling in various classes (Economy/Business) etc.
The major difference between an OLTP and OLAP system is the amount of data analyzed in
a single transaction. Whereas an OLTP manage many concurrent customers and queries
touching only an individual record or limited groups of files at a time. An OLAP system must
have the capability to operate on millions of files to answer a single query.

Feature OLTP OLAP

Characteristic It is a system which is used to It is a system which is used to manage


manage operational Data. informational Data.

Users Clerks, clients, and information Knowledge workers, including managers,


technology professionals. executives, and analysts.

System OLTP system is a customer- OLAP system is market-oriented, knowledge


orientation oriented, transaction, and query workers including managers, do data analysts
processing are done by clerks, executive and analysts.
clients, and information
technology professionals.

Data contents OLTP system manages current OLAP system manages a large amount of
data that too detailed and are historical data, provides facilitates for
used for decision making. summarization and aggregation, and stores and
manages data at different levels of granularity.
This information makes the data more
comfortable to use in informed decision
making.

Database Size 100 MB-GB 100 GB-TB

Database OLTP system usually uses an OLAP system typically uses either a star or
design entity-relationship (ER) data snowflake model and subject-oriented database
model and application-oriented design.
database design.

View OLTP system focuses primarily OLAP system often spans multiple versions of
on the current data within an a database schema, due to the evolutionary
enterprise or department, without process of an organization. OLAP systems also
referring to historical information deal with data that originates from various
or data in different organizations. organizations, integrating information from
many data stores.

Volume of data Not very large Because of their large volume, OLAP data are
stored on multiple storage media.

Access patterns The access patterns of an OLTP Accesses to OLAP systems are mostly read-
system subsist mainly of short, only methods because of these data warehouses
atomic transactions. Such a stores historical data.
system requires concurrency
control and recovery techniques.

Access mode Read/write Mostly write

Insert and Short and fast inserts and updates Periodic long-running batch jobs refresh the
Updates proposed by end-users. data.

Number of Tens Millions


records
accessed

Normalization Fully Normalized Partially Normalized

Processing Very Fast It depends on the amount of files contained,


Speed batch data refresh, and complex query may
take many hours, and query speed can be
upgraded by creating indexes.

Data Warehouse is needed for the following reasons:

1. Business User
2. Store historical data
3. Make strategic decisions
4. For data consistency and quality
5. High response time

1. Business User : Business users require a data warehouse to view


summarised data from the past. Since these people are non-technical, the
data may be presented to them in an elementary form.
2. Store historical data : Data Warehouse is required Io store the time variable
data from the past. This input is made to be used for various purposes.
3. Make strategic decisions : Some strategies may be depending upon the
data in the data warehouse. So, data warehouse contributes to making
strategic decisions.
4. For data consistency and quality : Bringing the data from different sources
at a commonplace, the user can effectively undertake to bring the uniformity
and consistency in data.
5. High response time : Data warehouse has to he ready for somewhat
unexpected loads and types of queries, which demands a significant degree
of flexibility and quick response time.
Data Warehouse Architecture
A data warehouse architecture is a method of defining the overall architecture of data communication
processing and presentation that exist for end-clients computing within the enterprise. Each data
warehouse is different, but all are characterized by standard vital components.
Production applications such as payroll accounts payable product purchasing and inventory control are
designed for online transaction processing (OLTP). Such applications gather detailed data from day to
day operations.
Data Warehouse applications are designed to support the user ad-hoc data requirements, an activity
recently dubbed online analytical processing (OLAP). These include applications such as forecasting,
profiling, summary reporting, and trend analysis.
Production databases are updated continuously by either by hand or via OLTP applications. In contrast,
a warehouse database is updated from operational systems periodically, usually during off-hours. As
OLTP data accumulates in production databases, it is regularly extracted, filtered, and then loaded into a
dedicated warehouse server that is accessible to users. As the warehouse is populated, it must be
restructured tables de-normalized, data cleansed of errors and redundancies and new fields and keys
added to reflect the needs to the user for sorting, combining, and summarizing data.
Data warehouses and their architectures very depending upon the elements of an organization's situation.
Three common architectures are:
o Data Warehouse Architecture: Basic
o Data Warehouse Architecture: With Staging Area
o Data Warehouse Architecture: With Staging Area and Data Marts
Data Warehouse Architecture: Basic

Operational System
An operational system is a method used in data warehousing to refer to a system that is used to process
the day-to-day transactions of an organization.
Flat Files
A Flat file system is a system of files in which transactional data is stored, and every file in the system
must have a different name.
Meta Data
A set of data that defines and gives information about other data.
Meta Data used in Data Warehouse for a variety of purpose, including:
Meta Data summarizes necessary information about data, which can make finding and work with
particular instances of data more accessible. For example, author, data build, and data changed, and file
size are examples of very basic document metadata.
Metadata is used to direct a query to the most appropriate data source.
Lightly and highly summarized data
The area of the data warehouse saves all the predefined lightly and highly summarized (aggregated) data
generated by the warehouse manager.
The goals of the summarized information are to speed up query performance. The summarized record is
updated continuously as new information is loaded into the warehouse.
End-User access Tools
The principal purpose of a data warehouse is to provide information to the business managers for
strategic decision-making. These customers interact with the warehouse using end-client access tools.
The examples of some of the end-user access tools can be:
o Reporting and Query Tools
o Application Development Tools
o Executive Information Systems Tools
o Online Analytical Processing Tools
o Data Mining Tools
Data Warehouse Architecture: With Staging Area
We must clean and process your operational information before put it into the warehouse.
W
e can do this programmatically, although data warehouses uses a staging area (A place where data is
processed before entering the warehouse).
A staging area simplifies data cleansing and consolidation for operational method coming from multiple
source systems, especially for enterprise data warehouses where all relevant data of an enterprise is
consolidated.

Data Warehouse Staging Area is a temporary location where a record from source systems is copied.
Data Warehouse Architecture: With Staging Area and Data Marts
We may want to customize our warehouse's architecture for multiple groups within our organization.
We can do this by adding data marts. A data mart is a segment of a data warehouses that can provided
information for reporting and analysis on a section, unit, department or operation in the company, e.g.,
sales, payroll, production, etc.
The figure illustrates an example where purchasing, sales, and stocks are separated. In this example, a
financial analyst wants to analyze historical data for purchases and sales or mine historical information
to make predictions about customer behavior.

Properties of Data Warehouse Architectures


The following architecture properties are necessary for a data warehouse system:

1. Separation: Analytical and transactional processing should be keep apart as much as possible.
2. Scalability: Hardware and software architectures should be simple to upgrade the data volume, which
has to be managed and processed, and the number of user's requirements, which have to be met,
progressively increase.
3. Extensibility: The architecture should be able to perform new operations and technologies without
redesigning the whole system.
4. Security: Monitoring accesses are necessary because of the strategic data stored in the data
warehouses.
5. Administerability: Data Warehouse management should not be complicated.
Types of Data Warehouse Architectures

Single-Tier Architecture
Single-Tier architecture is not periodically used in practice. Its purpose is to minimize the amount of
data stored to reach this goal; it removes data redundancies.
The figure shows the only layer physically available is the source layer. In this method, data warehouses
are virtual. This means that the data warehouse is implemented as a multidimensional view of
operational data created by specific middleware, or an intermediate processing layer.

The vulnerability of this architecture lies in its failure to meet the requirement for separation between
analytical and transactional processing. Analysis queries are agreed to operational data after the
middleware interprets them. In this way, queries affect transactional workloads.
Two-Tier Architecture
The requirement for separation plays an essential role in defining the two-tier architecture for a data
warehouse system, as shown in fig:

Although it is typically called two-layer architecture to highlight a separation between physically


available sources and data warehouses, in fact, consists of four subsequent data flow stages:
1. Source layer: A data warehouse system uses a heterogeneous source of data. That data is stored
initially to corporate relational databases or legacy databases, or it may come from an
information system outside the corporate walls.
2. Data Staging: The data stored to the source should be extracted, cleansed to remove
inconsistencies and fill gaps, and integrated to merge heterogeneous sources into one standard
schema. The so-named Extraction, Transformation, and Loading Tools (ETL) can combine
heterogeneous schemata, extract, transform, cleanse, validate, filter, and load source data into a
data warehouse.
3. Data Warehouse layer: Information is saved to one logically centralized individual repository:
a data warehouse. The data warehouses can be directly accessed, but it can also be used as a
source for creating data marts, which partially replicate data warehouse contents and are
designed for specific enterprise departments. Meta-data repositories store information on
sources, access procedures, data staging, users, data mart schema, and so on.
4. Analysis: In this layer, integrated data is efficiently, and flexible accessed to issue reports,
dynamically analyze information, and simulate hypothetical business scenarios. It should feature
aggregate information navigators, complex query optimizers, and customer-friendly GUIs.
Three-Tier Architecture
The three-tier architecture consists of the source layer (containing multiple source system), the
reconciled layer and the data warehouse layer (containing both data warehouses and data marts). The
reconciled layer sits between the source data and data warehouse.
The main advantage of the reconciled layer is that it creates a standard reference data model for a
whole enterprise. At the same time, it separates the problems of source data extraction and integration
from those of data warehouse population. In some cases, the reconciled layer is also directly used to
accomplish better some operational tasks, such as producing daily reports that cannot be satisfactorily
prepared using the corporate applications or generating data flows to feed external processes
periodically to benefit from cleaning and integration.
This architecture is especially useful for the extensive, enterprise-wide systems. A disadvantage of this
structure is the extra file storage space used through the extra redundant reconciled layer. It also makes
the analytical tools a little further away from being real-time.

Three-Tier Data Warehouse Architecture


Data Warehouses usually have a three-level (tier) architecture that includes:
1. Bottom Tier (Data Warehouse Server)
2. Middle Tier (OLAP Server)
3. Top Tier (Front end Tools).
A bottom-tier that consists of the Data Warehouse server, which is almost always an RDBMS. It may
include several specialized data marts and a metadata repository.
Data from operational databases and external sources (such as user profile data provided by external
consultants) are extracted using application program interfaces called a gateway. A gateway is provided
by the underlying DBMS and allows customer programs to generate SQL code to be executed at a
server.
Examples of gateways contain ODBC (Open Database Connection) and OLE-DB (Open-Linking and
Embedding for Databases), by Microsoft, and JDBC (Java Database Connection).

A middle-tier which consists of an OLAP server for fast querying of the data warehouse.
The OLAP server is implemented using either
(1) A Relational OLAP (ROLAP) model, i.e., an extended relational DBMS that maps functions on
multidimensional data to standard relational operations.
(2) A Multidimensional OLAP (MOLAP) model, i.e., a particular purpose server that directly
implements multidimensional information and operations.
A top-tier that contains front-end tools for displaying results provided by OLAP, as well as additional
tools for data mining of the OLAP-generated data.
The overall Data Warehouse Architecture is shown in fig:
The metadata repository stores information that defines DW objects. It includes the following
parameters and information for the middle and the top-tier applications:
1. A description of the DW structure, including the warehouse schema, dimension, hierarchies, data
mart locations, and contents, etc.
2. Operational metadata, which usually describes the currency level of the stored data, i.e., active,
archived or purged, and warehouse monitoring information, i.e., usage statistics, error reports,
audit, etc.
3. System performance data, which includes indices, used to improve data access and retrieval
performance.
4. Information about the mapping from operational databases, which provides source RDBMSs and
their contents, cleaning and transformation rules, etc.
5. Summarization algorithms, predefined queries, and reports business data, which include business
terms and definitions, ownership information, etc.
Principles of Data Warehousing

Load Performance
Data warehouses require increase loading of new data periodically basis within narrow time windows;
performance on the load process should be measured in hundreds of millions of rows and gigabytes per
hour and must not artificially constrain the volume of data business.
Load Processing
Many phases must be taken to load new or update data into the data warehouse, including data
conversion, filtering, reformatting, indexing, and metadata update.
Data Quality Management
Fact-based management demands the highest data quality. The warehouse ensures local consistency,
global consistency, and referential integrity despite "dirty" sources and massive database size.
Query Performance
Fact-based management must not be slowed by the performance of the data warehouse RDBMS; large,
complex queries must be complete in seconds, not days.
Terabyte Scalability
Data warehouse sizes are growing at astonishing rates. Today these size from a few to hundreds of
gigabytes and terabyte-sized data warehouses.

Autonomous Data Warehouse


Oracle Autonomous Data Warehouse is the world’s first and only autonomous database optimized for
analytic workloads, including data marts, data warehouses, data lakes, and data lakehouses. With
Autonomous Data Warehouse, data scientists, business analysts, and nonexperts can rapidly, easily, and
cost-effectively discover business insights using data of any size and type. Built for the cloud and
optimized using Oracle Exadata, Autonomous Data Warehouse benefits from faster performance

What is the Snowflake Data Warehouse?

Snowflake is a cloud-based Data Warehouse solution provided as Saas (Software-as-a-Service) with full

ANSI SQL support. It also has a unique structure that allows users to simply create tables and start
query data with very little management or DBA tasks required

Snowflake Architecture

Snowflake architecture contains a combination of standard shared disk and unallocated formats to

provide the best for both. Let’s go through these buildings and see how Snowflake integrates them into a

new mixed-type construction.

Shared-Disk Architecture Overview

Used on a standard website, shared disk architecture has a single storage layer accessible to all cluster

nodes. Many cluster nodes with CPU and Memory without disk storage themselves connect to the

central storage layer for data processing and processing.

Shared-Nothing Architecture Overview

In contrast to the Shared-Disk architecture, Shared-Nothing architecture distributed cluster nodes and

disk storage, its CPU, and Memory. The advantage here is that data can be categorized and stored across

all cluster nodes as each cluster node has its own disk storage.

Snowflake Architecture – Hybrid Model

The snowflake supports high-level formation as shown in the diagram below. Snowflake has 3 different

layers:

1. Storage Layout

2. Computer Layer

3. Cloud Services Background

1. Storage Layout

Snowflake organizes data into many smaller compartments that are internalized and compressed. Uses

column format to save. Data is stored in cloud storage and acts as a shared disk model thus providing
ease of data management. This ensures that users do not have to worry about data distribution across all

multiple nodes in the unassigned model.

Calculation notes connect to the storage layer to download query processing data. Since the storage

layer is independent, we only pay for the monthly storage amount used. As Snowflake is offered on the

Cloud, storage is expandable and charged as per the monthly TB use.

2. Computer Layer

Snowflake uses the “Virtual Warehouse” (described below) to answer questions. Snowflake splits the

query processing layer into disk storage. It uses queries in this layer using data from the storage layer.

Virtual Warehouses MPP compiles include multiple nodes with CPU and Memory provided in the cloud

by Snowflake. Multiple Virtual Warehouses can be created on Snowflake for a variety of needs

depending on the workload. Each visible warehouse can operate on a single layer of storage. Typically,

the visible Warehouse has its own independent computer collection and does not interact with other

warehouses.

Autonomous Data Warehouse Vs Snowflake


Autonomous Data Warehouse (ADW) and Snowflake are both popular cloud-based data warehousing
solutions, but they come from different vendors and have some differences in terms of features,
architecture, and pricing. Let's compare them in several key aspects:

1. Vendor and Ecosystem:


• ADW is offered by Oracle and is part of the Oracle Cloud ecosystem. It's designed to integrate
well with other Oracle products and services.
• Snowflake is an independent cloud data warehousing company. It's known for its cloud-native
architecture and its focus on providing a data platform that works across multiple cloud
providers.
2. Architecture:
• ADW uses Oracle's database technology and is designed to work with Oracle Database, utilizing
its features for data warehousing.
• Snowflake has a unique architecture that separates storage and compute, allowing for more
flexible scaling. It uses virtual warehouses (compute clusters) that can be scaled up or down
independently from storage.
3. Automated Management:
• Both ADW and Snowflake offer automated management features. ADW uses AI and machine
learning to automate various tasks, while Snowflake's architecture inherently enables automatic
scaling and resource management.
4. Performance:
• ADW leverages Oracle's technology for performance optimizations, including in-memory
processing and parallel query execution.
• Snowflake's architecture is built for elasticity and scalability, which can help maintain
performance during peak workloads.
5. Ease of Use:
• Both platforms are designed with ease of use in mind. ADW provides tools for data loading,
modeling, and querying. Snowflake emphasizes simplicity and ease of use through its web-based
interface.
6. Pricing:
• ADW follows an Oracle-style pricing model, which can include various components like
storage, compute, and features. Pricing can be complex and may involve licensing
considerations.
• Snowflake offers a more transparent and flexible pricing model, generally charging for storage
and compute separately. This can make it easier to estimate costs.
7. Multi-Cloud Support:
• Snowflake was built with multi-cloud support in mind, allowing users to run the same Snowflake
instance across different cloud providers (like AWS, Azure, and Google Cloud).
• ADW is primarily offered within the Oracle Cloud ecosystem.
8. Data Sharing:
• Both ADW and Snowflake provide data sharing capabilities, allowing organizations to securely
share data with external parties.
9. Security:
• Both solutions offer security features such as encryption, user access controls, and compliance
certifications.

The modern cloud data warehouse


Cloud data warehouses are more adaptable, performant, and powerful than in-house
systems. Businesses can save on staffing and can put their IT staff to better use,
because their infrastructure is managed by dedicated specialists.

Cloud data warehouses feature column-oriented databases, where the unit of


storage is a single attribute, with values from all records. Columnar storage does not
change how customers organize or represent data, but allows for faster access and
processing.

Cloud data warehouses also offer automatic, near-real-time scalability and greater
system reliability and uptime than on-premises hardware, and transparent billing,
which allows enterprises to pay only for what they use.

Because cloud data warehouses don't rely on the rigid structures and data modeling
concepts inherent in traditional systems, they have diverse architectures.

• Amazon Redshift's approach is akin to infrastructure-as-a-service (IaaS) or platform-as-a-


service (PaaS). Redshift is highly scalable, provisioning clusters of nodes to customers as
their storage and computing needs evolve. Each node has individual CPU, RAM, and
storage space, facilitating the massive parallel processing (MPP) needed for any big data
application, especially the data warehouse. Customers have to be responsible for some
capacity planning and must provision compute and storage nodes on the platform.

• The Google BigQuery approach is more like software-as-a-service (SaaS) that allows
interactive analysis of big data. It can be used alongside Google Cloud Storage and
technologies such as MapReduce. BigQuery differentiates itself with a serverless
architecture, which means users cannot see details of resource allocation, as computational
and storage provisioning happens continuously and dynamically.

• Snowflake's automatically managed storage layer can contain structured or semistructured


data, such as nested JSON objects. The compute layer is composed of clusters, each of
which can access all data but work independently and concurrently to enable automatic
scaling, distribution, and rebalancing. Snowflake is a data warehouse-as-a-service, and
operates across multiple clouds, including AWS, Microsoft Azure and, soon, Google
Cloud.

• Microsoft Azure SQL Data Warehouse is an elastic, large-scale data warehouse PaaS that
leverages the broad ecosystem of SQL Server. Like other cloud storage and computing
platforms, it uses a distributed, MPP architecture and columnar data store. It gathers data
from databases and SaaS platforms into one powerful, fully-managed centralized repository.

Data warehousing schemas


Data warehouses are relational databases, and they are associated with traditional
schemas, which are the ways in which records are described and organized.

• A snowflake schema arranges tables and their connections so that a representative entity
relationship diagram (ERD) resembles a snowflake. A centralized fact table connects to
many dimension tables, which themselves connect to more dimension tables, and so on.
Data is normalized.

Snowflake schema: SqlPac


• The simpler star schema is a special case of the snowflake schema. Only one level of
dimension tables is connected to the central fact table, resulting in ERDs with star shapes.
These dimension tables are denormalized, containing all attributes and information
associated with the particular type of record they hold.

You might also like