Lecture 6
Lecture 6
Lecture # 06
Instructor: Mr. Sharjeel Ahmed
Slide Elements
• The architectural components
• Data Warehouse Architecture
• Distinguishing Characteristics
• Architectural Framework
• Technical Architecture
DATA WAREHOUSE ARCHITECTURE
Architecture: Definitions
• The structure that brings all the components of a data warehouse
together is known as the architecture.
• In your data warehouse, architecture includes a number of factors.
• Primarily, it includes the integrated data that is the centerpiece.
• The architecture includes everything that is needed to prepare the
data and store it.
• It also includes all the means for delivering information from your data
warehouse.
• The architecture is further composed of the rules, procedures, and
functions that enable your data warehouse to work and fulfill the
business requirements.
• Finally, the architecture is made up of the technology that empowers
your data warehouse.
General Purpose of The Architecture
• The architecture provides the overall framework for developing and
deploying your data warehouse.
• It is a comprehensive blueprint.
• The architecture is not the set of tools needed to perform functions &
provide services. When we refer to the data extraction function within
one of the architectural components, we are simply mentioning the
function itself and the various tasks associated with it. Also, we are
relating the data store for the staging area to the data extraction
function because extracted data is moved to the staging area. Where
do the tools fit in? Tools are the means to implement the architecture.
• Architecture comes first and the tools follow.
Technical Architecture (Cont. )
• Let us now move on to consider the technical architecture in each of
the three major areas of the data warehouse:
1. Data Acquisition
2. Data Storage
3. Information Delivery
Data Flow: The data flow begins at the data sources and pauses at the
staging area. After transformation and integration, the data is ready for
loading into the data warehouse repository.
Data Acquisition (Cont. )
Data Acquisition (Cont. )
Data Sources: For the majority of data warehouses, the primary data
source consists of the enterprise’s operational systems.
• Many operational systems at several enterprises are still legacy systems
that resides on hierarchical or network databases. Use appropriate
language of the particular DBMS to extract data.
• Some More recent operational systems run on the client/server
architecture. Usually, these systems are supported by relational DBMSs.
Here you use an SQL-based language for extracting data.
• Large number of companies have adopted ERP (enterprise resource
planning) systems. ERP data sources provide an advantage in that the
data from these sources is already consolidated and integrated. There
could, however, be a few drawbacks to using ERP. You will have to use the
ERP vendor’s proprietary tool for data extraction. Also, most of the ERP
offerings contain very large numbers of source data tables.
• For Data from outside sources, you will have to create temporary files to
hold the data received from outside sources. After reformatting and
rearranging data elements, you will have to move the data to staging area.
Data Acquisition (Cont. )
Intermediary Data Stores: As data gets extracted from data sources,
it moves through temporary files.
• Sometimes, extracts of homogeneous data from several source
applications are pulled into separate temporary files and then merged
into another temporary file before moving it to the staging area.
• The opposite process is also common. From each application, one or
two large flat files are created and then divided into smaller files and
merged appropriately before moving the data to the staging area.
• Typically, the general practice is to use flat files to extract data from
operational systems.
Staging Area:
• This is the place where all the extracted data is put together and
prepared for loading into the data warehouse. The staging area may
contain data at the lowest grain to populate tables containing
business measurements. Staging area data repositories are relational
databases containing the fully integrated and cleansed data.
Data Acquisition (Cont. )
• Functions and Services: This is a general list. It does not indicate
the extent or complexity of each function or service:
1. Data Extraction
• Select data sources and determine types of filters to be applied
• Generate automatic extract files from operational systems using
replication and other techniques
• Create intermediary files to store selected data to be merged later
• Transport extracted files from multiple platforms
• Provide automated job control services for creating extract files
• Reformat input from outside sources
• Reformat input from departmental data files, databases, and
spreadsheets
• Generate common application code for data extraction
• Resolve inconsistencies for common data elements from multiple
sources
Data Acquisition (Cont. )
2. Data Transformation
• Map input data to data for data warehouse repository.
• Clean data, de-duplicate, and merge/purge.
• De-normalize extracted data structures as required by the
dimensional model of the data warehouse.
• Convert data types.
• Calculate and derive attribute values.
• Check for referential integrity.
• Aggregate data as needed. Resolve missing values.
• Consolidate and integrate data
Data Acquisition (Cont. )
3. Data Staging
• Provide backup and recovery.
• Sort and merge files.
• Create files as input to make changes to dimension tables
• If data staging storage is a relational database, create and populate
database.
• Preserve audit trail to relate each data item in the data warehouse to
input source.
• Resolve and create primary and foreign keys for load tables.
• Consolidate datasets and create flat files for loading through DBMS
utilities.
• If staging area storage is a relational database, extract load files
Data Storage (Cont. )
• This area covers the process of loading the data from the staging
area into the data warehouse repository.
Data Storage (Cont. )
Data Storage: This area covers the process of loading the data from
the staging area into the data warehouse repository.
Data Flow: For data storage, the data flow begins at the data staging
area to the data warehouse repository.
• If the data warehouse is an enterprise-wide data warehouse being
built in a top-down fashion, then there could be movements of data
from the enterprise-wide data warehouse repository to the
repositories of the dependent data marts.
• Alternatively, if data warehouse is being built in a bottom-up manner,
then the data movements stop with the appropriate conformed data
marts.
Data Storage (Cont. )
Data Groups: Prepared data waiting in the data staging area fall into
two groups.
• The first group is the set of files or tables containing data for a full
refresh. This group of data is usually meant for the initial loading of
the data warehouse. Occasionally, some data warehouse tables may
be refreshed fully.
• The other group of data is the set of files or tables containing ongoing
incremental loads. Most of these relate to nightly loads. Some
incremental loads of dimension data may be performed at less
frequent intervals.
• Data Flow: For information delivery, the data flow begins at the
enterprise-wide data warehouse and the dependent data marts when
the design is based on the top-down technique.
• When the design follows the bottom-up method, the data flow starts
at the set of conformed data marts.
• Data transformed into information flows to the user desktops during
query sessions.
Information Delivery (Cont. )
Service Locations: In your information delivery component, you may
provide query services from the user desktop, from an application
server, or from the database itself. This will be one of the critical
decisions for your architecture design.
Data Store: For information delivery, you may consider the following
intermediary data stores:
• Proprietary temporary stores to hold results of individual queries and
reports for repeated use
• Data stores for standard reporting
• Proprietary multidimensional databases
Functions and Services: This is a general list. It does not indicate the
extent or complexity of each function or service.
• Provide security to control information access
Information Delivery (Cont. )
• Monitor user access to improve service and for future enhancements
• Allow users to browse data warehouse content
• Simplify access by hiding internal complexities of storage from users
• Automatically reformat queries for optimal execution
• Enable queries to be aware of aggregate tables for faster results
• Govern queries and control runaway queries
• Provide self-service report generation for users, consisting of a
variety of flexible options to create, schedule, and run reports
• Store result sets of queries and reports for future use
• Provide multiple levels of data granularity
• Provide event triggers to monitor data loading
• Make provision for the users to perform complex analysis through
online analytical processing (OLAP)
• Enable data feeds to downstream, specialized decisions support
systems such as EIS and data mining
Information Delivery (Cont. )