Unit 1
Unit 1
Unit 1
Data Warehouse environment contains an extraction, transportation, and loading (ETL) solution,
an online analytical processing (OLAP) engine, customer analysis tools, and other applications
that handle the process of gathering information and delivering it to business users.
A Data Warehouse (DW) is a relational database that is designed for query and analysis
rather than transaction processing. It includes historical data derived from transaction data from
single and multiple sources.
A Data Warehouse is a group of data specific to the entire organization, not only to a
particular group of users.
It is not used for daily operations and transaction processing but used for making decisions.
A Data Warehouse can be viewed as a data system with the following attributes:
o It is a database designed for investigative tasks, using data from various applications.
o It supports a relatively small number of clients with relatively long interactions.
o It includes current and historical data to provide a historical perspective of information.
1
o Its usage is read-intensive.
o It contains a few large tables.
Subject-Oriented
A data warehouse target on the modeling and analysis of data for decision-makers.
Therefore, data warehouses typically provide a concise and straightforward view around a
particular subject, such as customer, product, or sales, instead of the global organization's
ongoing operations. This is done by excluding data that are not useful concerning the subject and
including all data needed by the users to understand the subject.
2
Integrated
A data warehouse integrates various heterogeneous data sources like RDBMS, flat files,
and online transaction records. It requires performing data cleaning and integration during data
warehousing to ensure consistency in naming conventions, attributes types, etc., among different
data sources.
Time-Variant
Historical information is kept in a data warehouse. For example, one can retrieve files
from 3 months, 6 months, 12 months, or even previous data from a data warehouse. These
variations with a transactions system, where often only the most current file is kept.
3
Non-Volatile
The data warehouse is a physically separate data storage, which is transformed from the
source operational RDBMS. The operational updates of data do not occur in the data warehouse,
i.e., update, insert, and delete operations are not performed. It usually requires only two
procedures in data accessing: Initial loading of data and access to data. Therefore, the DW does
not require transaction processing, recovery, and concurrency capabilities, which allows for
substantial speedup of data retrieval. Non-Volatile defines that once entered into the warehouse,
and data should not change.
The idea of data warehousing came to the late 1980's when IBM researchers Barry Devlin
and Paul Murphy established the "Business Data Warehouse."
In essence, the data warehousing idea was planned to support an architectural model for
the flow of information from the operational system to decisional support environments. The
concept attempt to address the various problems associated with the flow, mainly the high costs
associated with it.
In the absence of data warehousing architecture, a vast amount of space was required to
support multiple decision support environments. In large corporations, it was ordinary for various
decision support environments to operate independently.
4
1. 1) Business User: Business users require a data warehouse to view summarized data from
the past. Since these people are non-technical, the data may be presented to them in an
elementary form.
2. 2) Store historical data: Data Warehouse is required to store the time variable data from
the past. This input is made to be used for various purposes.
3. 3) Make strategic decisions: Some strategies may be depending upon the data in the data
warehouse. So, data warehouse contributes to making strategic decisions.
4. 4) For data consistency and quality: Bringing the data from different sources at a
commonplace, the user can effectively undertake to bring the uniformity and consistency
in data.
5. 5) High response time: Data warehouse has to be ready for somewhat unexpected loads
and types of queries, which demands a significant degree of flexibility and quick response
time.
5
The figure shows the essential elements of a typical warehouse. We see the Source Data
component shows on the left. The Data staging element serves as the next building block.
In the middle, we see the Data Storage component that handles the data warehouses data.
This element not only stores and manages the data; it also keeps track of data using the
metadata repository. The Information Delivery component shows on the right consist of all
the different ways of making the information from the data warehouses available to the
users.
Source Data Component
Source data coming into the data warehouses may be grouped into four broad
categories:
Production Data: This type of data comes from the different operating systems of the
enterprise. Based on the data requirements in the data warehouse, we choose segments of
the data from the various operational modes.
Internal Data: In each organization, the client keeps their "private" spreadsheets, reports,
customer profiles, and sometimes even department databases. This is the internal data, part
of which could be useful in a data warehouse.
Archived Data: Operational systems are mainly intended to run the current business. In
every operational system, we periodically take the old data and store it in achieved files.
External Data: Most executives depend on information from external sources for a large
percentage of the information they use. They use statistics associating to their industry
produced by the external department.
6
Data Staging Component
After we have been extracted data from various operational systems and external
sources, we have to prepare the files for storing in the data warehouse. The extracted data
coming from several different sources need to be changed, converted, and made ready in a
format that is relevant to be saved for querying and analysis.
We will now discuss the three primary functions that take place in the staging area.
1) Data Extraction: This method has to deal with numerous data sources. We have to
employ the appropriate techniques for each data source.
2) Data Transformation: As we know, data for a data warehouse comes from many
different sources. If data extraction for a data warehouse posture big challenges, data
transformation present even significant challenges. We perform several individual tasks as
part of data transformation.
First, we clean the data extracted from each source. Cleaning may be the correction of
misspellings or may deal with providing default values for missing data elements, or
elimination of duplicates when we bring in the same data from various source systems.
On the other hand, data transformation also contains purging source data that is not useful and
separating outsource records into new combinations. Sorting and merging of data take place
on a large scale in the data staging area. When the data transformation function ends, we have
a collection of integrated data that is cleaned, standardized, and summarized.
7
3) Data Loading: Two distinct categories of tasks form data loading functions. When we
complete the structure and construction of the data warehouse and go live for the first time, we
do the initial loading of the information into the data warehouse storage. The initial load moves
high volumes of data using up a substantial amount of time.
Data storage for the data warehousing is a split repository. The data repositories for the
operational systems generally include only the current data. Also, these data repositories
include the data structured in highly normalized for fast and efficient processing.
The information delivery element is used to enable the process of subscribing for data
warehouse files and having it transferred to one or more destinations according to some
customer-specified scheduling algorithm.
Metadata Component
Metadata in a data warehouse is equal to the data dictionary or the data catalog in a database
management system. In the data dictionary, we keep the data about the logical data structures,
the data about the records and addresses, the information about the indexes, and so on.
8
Data Marts
It includes a subset of corporate-wide data that is of value to a specific group of users. The
scope is confined to particular selected subjects. Data in a data warehouse should be a fairly
current, but not mainly up to the minute, although development in the data warehouse industry
has made standard and incremental data dumps more achievable. Data marts are lower than
data warehouses and usually contain organization. The current trends in data warehousing are
to developed a data warehouse with several smaller related data marts for particular kinds of
queries and reports.
The management and control elements coordinate the services and functions within the data
warehouse. These components control the data transformation and the data transfer into the
data warehouse storage. On the other hand, it moderates the data delivery to the clients. Its
work with the database management systems and authorizes data to be correctly saved in the
repositories. It monitors the movement of information into the staging method and from there
into the data warehouses storage itself.
Data Warehouse queries are complex because they involve the computation of large groups of
data at summarized levels.
It may require the use of distinctive data organization, access, and implementation method
based on multidimensional views.
Performing OLAP queries in operational database degrade the performance of functional tasks.
Data Warehouse is used for analysis and decision making in which extensive database is
required, including historical data, which operational database does not typically maintain.
The separation of an operational database from data warehouses is based on the different
structures and uses of data in these systems.
Because the two systems provide different functionalities and require different kinds of data,
it is necessary to maintain separate databases.
Difference between Database and Data Warehouse
9
Database Data Warehouse
1. It is used for Online Transactional Processing 1. It is used for Online Analytical Processing (OLAP).
(OLTP) but can be used for other objectives such as This reads the historical information for the customers
Data Warehousing. This records the data from the for business decisions.
clients for history.
2. The tables and joins are complicated since they are 2. The tables and joins are accessible since they are de-
normalized for RDBMS. This is done to reduce normalized. This is done to minimize the response time
redundant files and to save storage space. for analytical queries.
4. Entity: Relational modeling procedures are used 4. Data: Modeling approach are used for the Data
for RDBMS database design. Warehouse design.
6. Performance is low for analysis queries. 6. High performance for analytical queries.
7. The database is the place where the data is taken as 7. Data Warehouse is the place where the application
a base and managed to get available fast and efficient data is handled for analysis and reporting objectives.
access.
The Operational Database is the source of information for the data warehouse. It includes detailed
information used to run the day to day operations of the business. The data frequently changes as
updates are made and reflect the current value of the last transactions.
Operational Database Management Systems also called as OLTP (Online Transactions Processing
Databases), are used to manage dynamic data in real-time.
10
Data Warehouse Systems serve users or knowledge workers in the purpose of data analysis and
decision-making. Such systems can organize and present information in specific formats to
accommodate the diverse needs of various users. These systems are called as Online-Analytical
Processing (OLAP) Systems.
Data Warehouse and the OLTP database are both relational databases. However, the goals of both
these databases are different.
Operational systems are designed to support high-volume transaction Data warehousing systems are typically
processing. designed to support high-volume
analytical processing (i.e., OLAP).
Operational systems are usually concerned with current data. Data warehousing systems are usually
concerned with historical data.
Data within operational systems are mainly updated regularly Non-volatile, new data may be added
according to need. regularly. Once Added rarely changed.
It is designed for real-time business dealing and processes. It is designed for analysis of business
measures by subject area, categories, and
attributes.
It is optimized for a simple set of transactions, generally adding or It is optimized for extent loads and high,
retrieving a single row at a time per table. complex, unpredictable queries that
access many rows per table.
It is optimized for validation of incoming information during Loaded with consistent, valid
transactions, uses validation data tables. information, requires no real-time
validation.
Operational systems are widely process-oriented. Data warehousing systems are widely
subject-oriented
Operational systems are usually optimized to perform fast inserts and Data warehousing systems are usually
updates of associatively small volumes of data. optimized to perform fast retrievals of
relatively high volumes of data.
Relational databases are created for on-line transactional Processing Data Warehouse designed for on-line
(OLTP) Analytical Processing (OLAP)
11
Data Warehouse Architecture
Production applications such as payroll accounts payable product purchasing and inventory control
are designed for online transaction processing (OLTP). Such applications gather detailed data
from day to day operations.
Data Warehouse applications are designed to support the user ad-hoc data requirements, an activity
recently dubbed online analytical processing (OLAP). These include applications such as
forecasting, profiling, summary reporting, and trend analysis.
Production databases are updated continuously by either by hand or via OLTP applications. In
contrast, a warehouse database is updated from operational systems periodically, usually during
off-hours. As OLTP data accumulates in production databases, it is regularly extracted, filtered,
and then loaded into a dedicated warehouse server that is accessible to users. As the warehouse is
populated, it must be restructured tables de-normalized, data cleansed of errors and redundancies
and new fields and keys added to reflect the needs to the user for sorting, combining, and
summarizing data.
Data warehouses and their architectures very depending upon the elements of an organization's
situation.
12
Operational System
An operational system is a method used in data warehousing to refer to a system that is used to
process the day-to-day transactions of an organization.
Flat Files
A Flat file system is a system of files in which transactional data is stored, and every file in the
system must have a different name.
Meta Data
A set of data that defines and gives information about other data.
Meta Data summarizes necessary information about data, which can make finding and work with
particular instances of data more accessible. For example, author, data build, and data changed,
and file size are examples of very basic document metadata.
The area of the data warehouse saves all the predefined lightly and highly summarized
(aggregated) data generated by the warehouse manager.
The goals of the summarized information are to speed up query performance. The summarized
record is updated continuously as new information is loaded into the warehouse.
The principal purpose of a data warehouse is to provide information to the business managers for
strategic decision-making. These customers interact with the warehouse using end-client access
tools.
13
Data Warehouse Architecture: With Staging Area
We must clean and process your operational information before put it into the warehouse.
We can do this programmatically, although data warehouses uses a staging area (A place where
data is processed before entering the warehouse).
A staging area simplifies data cleansing and consolidation for operational method coming from
multiple source systems, especially for enterprise data warehouses where all relevant data of an
enterprise is consolidated.
Data Warehouse Staging Area is a temporary location where a record from source systems is
copied.
14
Data Warehouse Architecture: With Staging Area and Data Marts
We may want to customize our warehouse's architecture for multiple groups within our
organization.
We can do this by adding data marts. A data mart is a segment of a data warehouses that can
provided information for reporting and analysis on a section, unit, department or operation in the
company, e.g., sales, payroll, production, etc.
The figure illustrates an example where purchasing, sales, and stocks are separated. In this
example, a financial analyst wants to analyze historical data for purchases and sales or mine
historical information to make predictions about customer behavior.
The following architecture properties are necessary for a data warehouse system:
15
1. Separation: Analytical and transactional processing should be keep apart as much as possible.
2. Scalability: Hardware and software architectures should be simple to upgrade the data volume,
which has to be managed and processed, and the number of user's requirements, which have to be
met, progressively increase.
3. Extensibility: The architecture should be able to perform new operations and technologies
without redesigning the whole system.
4. Security: Monitoring accesses are necessary because of the strategic data stored in the data
warehouses.
Single-Tier Architecture
Single-Tier architecture is not periodically used in practice. Its purpose is to minimize the amount
of data stored to reach this goal; it removes data redundancies.
The figure shows the only layer physically available is the source layer. In this method, data
warehouses are virtual. This means that the data warehouse is implemented as a multidimensional
view of operational data created by specific middleware, or an intermediate processing layer.
16
The vulnerability of this architecture lies in its failure to meet the requirement for separation between
analytical and transactional processing. Analysis queries are agreed to operational data after the
middleware interprets them. In this way, queries affect transactional workloads.
Two-Tier Architecture
The requirement for separation plays an essential role in defining the two-tier architecture
for a data warehouse system, as shown in fig:
1. Source layer: A data warehouse system uses a heterogeneous source of data. That data is
stored initially to corporate relational databases or legacy databases, or it may come from
an information system outside the corporate walls.
2. Data Staging: The data stored to the source should be extracted, cleansed to remove
inconsistencies and fill gaps, and integrated to merge heterogeneous sources into one
standard schema. The so-named Extraction, Transformation, and Loading Tools
17
(ETL) can combine heterogeneous schemata, extract, transform, cleanse, validate, filter,
and load source data into a data warehouse.
3. Data Warehouse layer: Information is saved to one logically centralized individual
repository: a data warehouse. The data warehouses can be directly accessed, but it can also
be used as a source for creating data marts, which partially replicate data warehouse
contents and are designed for specific enterprise departments. Meta-data repositories store
information on sources, access procedures, data staging, users, data mart schema, and so
on.
4. Analysis: In this layer, integrated data is efficiently, and flexible accessed to issue reports,
dynamically analyze information, and simulate hypothetical business scenarios. It should
feature aggregate information navigators, complex query optimizers, and customer-
friendly GUIs.
Three-Tier Architecture
The three-tier architecture consists of the source layer (containing multiple source system), the
reconciled layer and the data warehouse layer (containing both data warehouses and data marts).
The reconciled layer sits between the source data and data warehouse.
The main advantage of the reconciled layer is that it creates a standard reference data model for
a whole enterprise. At the same time, it separates the problems of source data extraction and
integration from those of data warehouse population. In some cases, the reconciled layer is also
directly used to accomplish better some operational tasks, such as producing daily reports that
cannot be satisfactorily prepared using the corporate applications or generating data flows to feed
external processes periodically to benefit from cleaning and integration.
This architecture is especially useful for the extensive, enterprise-wide systems. A disadvantage
of this structure is the extra file storage space used through the extra redundant reconciled layer. It
also makes the analytical tools a little further away from being real-time.
18
Three-Tier Data Warehouse Architecture
A bottom-tier that consists of the Data Warehouse server, which is almost always an RDBMS.
It may include several specialized data marts and a metadata repository.
Data from operational databases and external sources (such as user profile data provided by
external consultants) are extracted using application program interfaces called a gateway. A
gateway is provided by the underlying DBMS and allows customer programs to generate SQL
code to be executed at a server.
Examples of gateways contain ODBC (Open Database Connection) and OLE-DB (Open-
Linking and Embedding for Databases), by Microsoft, and JDBC (Java Database Connection).
A middle-tier which consists of an OLAP server for fast querying of the data warehouse.
19
(1) A Relational OLAP (ROLAP) model, i.e., an extended relational DBMS that maps functions
on multidimensional data to standard relational operations.
(2) A Multidimensional OLAP (MOLAP) model, i.e., a particular purpose server that directly
implements multidimensional information and operations.
A top-tier that contains front-end tools for displaying results provided by OLAP, as well as
additional tools for data mining of the OLAP-generated data.
The metadata repository stores information that defines DW objects. It includes the following
parameters and information for the middle and the top-tier applications:
20
4. Information about the mapping from operational databases, which provides
source RDBMSs and their contents, cleaning and transformation rules, etc.
5. Summarization algorithms, predefined queries, and reports business data, which include
business terms and definitions, ownership information, etc.
Load Performance
Data warehouses require increase loading of new data periodically basis within narrow time
windows; performance on the load process should be measured in hundreds of millions of rows
and gigabytes per hour and must not artificially constrain the volume of data business.
Load Processing
Many phases must be taken to load new or update data into the data warehouse, including data
conversion, filtering, reformatting, indexing, and metadata update.
Fact-based management demands the highest data quality. The warehouse ensures local
consistency, global consistency, and referential integrity despite "dirty" sources and massive
database size.
Query Performance
Fact-based management must not be slowed by the performance of the data warehouse
RDBMS; large, complex queries must be complete in seconds, not days.
21
Terabyte Scalability
Data warehouse sizes are growing at astonishing rates. Today these size from a few to
hundreds of gigabytes and terabyte-sized data warehouses.
This shows you how to load data from an Oracle Object Store into a database in
Autonomous Data Warehouse.
This is the third in a series of tutorials for Autonomous Data Warehouse. Perform the
tutorials sequentially.
22
• Using Oracle Machine Learning with Autonomous Data Warehouse Cloud (set of
additional tutorials)
23
Technical Differences
24
Modern Data Warehouse
A Modern Data Warehouse is a cloud-based solution that gathers and stores that information.
Organizations can process this data to make intelligent decisions. That’s why various
organizations use a Modern Data Warehouse to improve their finances, human resources, and
operations business processes. Quality cloud-based warehouse departments need this information
to make smarter decisions.
Once you acquired it, you need to upload it into the data warehouse. Data engineering uses
pipelines and ETL (extract, transform, load) tools. Using these different tools, you can upload
that information to a data warehouse similar to a factory. Data engineering is similar to a truck
bringing raw materials into a factory.
Once the data comes into the factory, you need someone to evaluate the quality of the data. You
then need to steward that data because security and privacy must be considered.
Data governance helps ensure the quality of the info by stewarding, prepping, and cleaning the
data to ensure it is ready for analysis.
Once you prep and clean the data, you can start using factory analysis to take that raw
material(data) and turn it into a finished good (business intelligence). For our purposes, we will
25
use Microsoft Power BIto help you visualize the information by using advanced analytics, KPIs,
and workflow automation. When you are finished, you can see exactly what’s going on with
your data.
Level 5: Data Science
Modern Data Warehouse is about more than seeing the information; it’s about using the data to
make smarter decisions. That’s one of the key concepts you should walk away with here today.
There are several different programs to help you leverage the data to your benefit, including:
• AI
• Deep learning
• Machine learning
• Statistical modeling
• Natural language processing (NLP)
Keep in mind that all the algorithms above need data to work successfully. The more data you
provide, the smarter your decisions, and the smarter your results. It’s essential to see if you want
to understand your reports that you leverage AI to get better answers, leading us back to Modern
Data Warehouse. Again, it is more than gathering and storing data. It is about making smart
decisions.
26