0% found this document useful (0 votes)
8 views91 pages

Unit 1 1

A Data Warehouse is an integrated, time-variant, and non-volatile repository of data designed to support management decision-making processes. It consists of various layers including data source, extraction, staging, ETL, storage, logic, presentation, metadata, and system operations, each serving specific functions in data management and retrieval. The architecture typically follows a 3-tier model encompassing data sources, OLAP servers, and front-end tools for user interaction and reporting.

Uploaded by

abdn89571
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
8 views91 pages

Unit 1 1

A Data Warehouse is an integrated, time-variant, and non-volatile repository of data designed to support management decision-making processes. It consists of various layers including data source, extraction, staging, ETL, storage, logic, presentation, metadata, and system operations, each serving specific functions in data management and retrieval. The architecture typically follows a 3-tier model encompassing data sources, OLAP servers, and front-end tools for user interaction and reporting.

Uploaded by

abdn89571
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 91

Data Warehouse

Concepts and Terminology


Definition of a Data Warehouse

A Data Warehouse is a subject oriented ,integrated, time


variant and non volatile structured repository of data used
for information retrieval and in support of management
decision making process.
• A data warehouse is a subject-oriented, integrated, time-variant and non-volatile collection
of data in support of management's decision making process.
• Subject-Oriented: A data warehouse can be used to analyze a particular subject area. For
example, "sales" can be a particular subject.
• Integrated: A data warehouse integrates data from multiple data sources. For example, source
A and source B may have different ways of identifying a product, but in a data warehouse,
there will be only a single way of identifying a product.
• Time-Variant: Historical data is kept in a data warehouse. For example, one can retrieve data
from 3 months, 6 months, 12 months, or even older data from a data warehouse. This
contrasts with a transactions system, where often only the most recent data is kept. For
example, a transaction system may hold the most recent address of a customer, where a data
warehouse can hold all addresses associated with a customer.
• Non-volatile: Once data is in the data warehouse, it will not change. So, historical data in a
data warehouse should never be altered.
• Data Warehouse ArchitectureDifferent data warehousing systems have different structures. Some may have an ODS (operational data store), while some may have multiple data marts. Some may have a small number of data sources, while
some may have dozens of data sources. In view of this, it is far more reasonable to present the different layers of a data warehouse architecture rather than discussing the specifics of any one system.

• In general, all data warehouse systems have the following layers:

• Data Source LayerData Extraction LayerStaging AreaETL LayerData Storage LayerData Logic LayerData Presentation LayerMetadata LayerSystem Operations LayerThe picture below shows the relationships among the different components of
the data warehouse architecture:

• Each component is discussed individually below:

• Data Source Layer

• This represents the different data sources that feed data into the data warehouse. The data source can be of any format -- plain text file, relational database, other types of database, Excel file, etc., can all act as a data source.

• Many different types of data can be a data source:

• Operations -- such as sales data, HR data, product data, inventory data, marketing data, systems data.Web server logs with user browsing data.Internal market research data.Third-party data, such as census data, demographics data, or survey
data.All these data sources together form the Data Source Layer.

• Data Extraction Layer

• Data gets pulled from the data source into the data warehouse system. There is likely some minimal data cleansing, but there is unlikely any major data transformation.

• Staging Area

• This is where data sits prior to being scrubbed and transformed into a data warehouse / data mart. Having one common area makes it easier for subsequent data processing / integration.

• ETL Layer

• This is where data gains its "intelligence", as logic is applied to transform the data from a transactional nature to an analytical nature. This layer is also where data cleansing happens. The ETL design phase is often the most time-consuming
phase in a data warehousing project, and an ETL tool is often used in this layer.

• Data Storage Layer

• This is where the transformed and cleansed data sit. Based on scope and functionality, 3 types of entities can be found here: data warehouse, data mart, and operational data store (ODS). In any given system, you may have just one of the
three, two of the three, or all three types.

• Data Logic Layer

• This is where business rules are stored. Business rules stored here do not affect the underlying data transformation rules, but do affect what the report looks like.

• Data Presentation Layer

• This refers to the information that reaches the users. This can be in a form of a tabular / graphical report in a browser, an emailed report that gets automatically generated and sent everyday, or an alert that warns users of exceptions, among
others. Usually an OLAP tool and/or a reporting tool is used in this layer.

• Metadata Layer

• This is where information about the data stored in the data warehouse system is stored. A logical data model would be an example of something that's in the metadata layer. A metadata tool is often used to manage metadata.

• System Operations Layer

• This layer includes information on how the data warehouse system operates, such as ETL job status, system performance, and user access history.
Data Warehouse:
It is an optimized form of operational database contain
only relevant information and provide fast access to
data.
 Subject oriented
Eg: Data related to all the departments of an
organization
 Integrated:
A
Different views
B Wareho Single unified
of data use view
C

Time – variant
 Nonvolatile
Data Warehouse Properties

Subject Integrated
Oriented

Data
Warehouse

Non Volatile Time Variant


Subject-Oriented

Data warehouse typically provide concise view of


particular subject issue by decision support system(DSS).
OLTP Applications Data Warehouse Subject
Equity
Plans Shares Customer
financial
Insurance information
Savings
Loans
Integrated

Data warehouse constructed by integrating multiple heterogeneous


sources like OLTP, RDBMS, Flat File etc.
.
Savings

Current
accounts

Loans

OLTP Applications
Integrated

Savings

Current
accounts
Customer

Loans

OLTP Applications Data Warehouse


Time-Variant

Every key structure in data warehouse contains time


element either implicitly or explicitly.
Stored data provides information from an historical
perspective.

Time Data
Jan-97 January
Feb-97 February
Mar-97 March
Nonvolatile

Typically data in the data warehouse is not updated or delelted.

Operational Warehouse

Load

Insert Read Read


Update
Delete
Data Warehouse
Versus
Operational Database

Data Warehouse Operational


Characteristics Database

Used by knowledge Used by Clerk, clients


worker, manager, and IT professionals
executive and analyst
User
Data Warehouse
Versus
Operational Database

Data Warehouse Operational


Characteristics Database

Long terms informational Day to Day operations


requirements in order to
decision supports
Functions
Data Warehouse
Versus
Operational Database

Data Warehouse Operational


Characteristics Database

No. of records
Millions Tens
accessed
Data Warehouse
Versus
Operational Database

Data Warehouse Operational


Characteristics Database

No. of Users Hundreds Thousands


Data Warehouse
Versus
Operational Database

Data Warehouse Operational


Characteristics Database

On-Line Analytical On-Line Transactional


Processing(OLAP) Processing(OLTP)
System OLTP is Customer
OLAP is market oriented Oriented and used for
Orientation and used for data transactional query
analysis processing
Data Warehouse
Versus
Operational Database

Data Warehouse Operational


Characteristics Database

Data Warehouse Manage Operational database


Historical Data. manage current data.
It provides facility for
Data Content summarization and
aggregation.
Data Warehouse
Versus
Operational Database

Data Warehouse Operational


Characteristics Database

Data Warehouse system Operational Database


adopt star, snowflake or adopt E-R Data model
fact constellation model and application oriented
Database Design and subject oriented database design.
database design
Data Warehouse
Versus
Operational Database

Data Warehouse Operational


Characteristics Database

Mostly Read Only Read/write operations


Operation

Access Pattern
Data Warehouse
Versus
Operational Database

Data Warehouse Operational


Characteristics Database

Database Size >=TBs few GBs


Data Warehouse
Versus
Operational Database

Data Warehouse Operational


Characteristics Database

Detailed,
Summarized and
View consolidated
flat relation
Data Warehouse
Versus
Operational Database

Data Warehouse Operational


Characteristics Database

Short and simple


Unit of Work Complex Queries
transaction
Data Warehouse
Versus
Operational Database

Data Warehouse Operational


Characteristics Database

Focus Information Out Data in


Data Warehouse
Versus
Operational Database

Data Warehouse Operational


Characteristics Database

High flexibility and end High performance and


Priority user autonomy high availability
Data Warehouse
Versus
Operational Database

Data Warehouse Operational


Characteristics Database

Query throughput and


Unit of Metric response time
Transaction Throughput
Data Warehouse
Versus
Operational Database

Data Warehouse Operational


Characteristics Database

Lots of Scan Index/hash on primary


Operations key
Data Warehouse vs. Operational DBMS

• OLTP (on-line transaction processing)


• Major task of traditional relational DBMS
• Day-to-day operations: purchasing, inventory, banking, manufacturing,
payroll, registration, accounting, etc.
• OLAP (on-line analytical processing)
• Major task of data warehouse system
• Data analysis and decision making
• Distinct features (OLTP vs. OLAP):
• User and system orientation: customer vs. market
• Data contents: current, detailed vs. historical, consolidated
• Database design: ER + application vs. star + subject
• View: current, local vs. evolutionary, integrated
• Access patterns: update vs. read-only but complex queries 27
OLTP vs. OLAP
OLTP OLAP
users clerk, IT professional knowledge worker
function day to day operations decision support
DB design application-oriented subject-oriented
data current, up-to-date historical,
detailed, flat relational summarized, multidimensional
isolated integrated, consolidated
usage repetitive ad-hoc
access read/write lots of scans
index/hash on prim. key
unit of work short, simple transaction complex query
# records accessed tens millions
#users thousands hundreds
DB size 100MB-GB 100GB-TB
metric transaction throughput query throughput, response

28
Why Separate Data Warehouse?
• High performance for both systems
• DBMS— tuned for OLTP: access methods, indexing,
concurrency control, recovery
• Warehouse—tuned for OLAP: complex OLAP queries,
multidimensional view, consolidation.
• Different functions and different data:
• missing data: Decision support requires historical data which
operational DBs do not typically maintain
• data consolidation: DS requires consolidation (aggregation,
summarization) of data from heterogeneous sources
• data quality: different sources typically use inconsistent data
representations, codes and formats which have to be
reconciled
29
3-Tier Data Warehouse
Architecture
Data ware house adopt a three tier architecture.

These 3 tiers are:


 Bottom Tier
 Middle Tier
Top Tier
Data Sources:
All the data related to any bussiness organization is stored
in operational databases, external files and flat files.

 These sources are application oriented


Eg: complete data of organization such as training detail,
customer detail, sales, departments, transactions, employee
detail etc.
 Data present here in different formats or host format
 Contain data that is not well documented
Bottom Tier: Data warehouse
server
Data Warehouse server fetch only relevant information based on data mining

Eg: customer profile information provided by external consultants.


 Data is feed into bottom tier by some backend tools and utilities.
Backend Tools & Utilities:
Functions performed by backend tools and utilities are:

Data Extraction
 Data Cleaning
 Data Transformation
 Load
 Refresh
Bottom Tier Contains:
 Data warehouse
 Metadata Repository
 Data Marts
 Monitoring and Administration
Metadata repository:

Data warehouse contains:


Structure of data warehouse
Data names and definitions
Source of extracted data
Algorithm used for data cleaning purpose
Sequence of transformations applied on data
Data related to system performance
Data Marts:

 Subsetof data warehouse contain only small slices of data


warehouse
Eg: Data pertaining to the single department or focusing area
 Two types of data marts:

Dependent Independent
sourced directly sourced from one or
from data warehouse more data sources
Monitoring & Administration:

Data Refreshment
Data source synchronization
Disaster recovery
Managing access control and security
Manage data growth, database performance
Controlling the number & range of queries
Limiting the size of data warehouse
Monitoring Administration Bottom Tier: Data
Warehouse Server

Data
Data
Metadata Warehouse Marts
Repository

Data
Source
B C
A
Middle Tier: OLAP Server
It presents the users a multidimensional data from data
warehouse or data marts.
 Typically implemented using two models:

ROLAP Model MOLAP Model


Present data in Present data in array
relational tables based structures means
map directly to data
cube array structure.
Top Tier: Front end tools
It is front end client layer.
Query and reporting tools
Reporting Tools: Production reporting tools
Report writers
Managed query tools: Point and click creation of
SQL used in customer mailing list.

 Analysis tools : Prepare charts based on analysis

Data mining Tools: mining knowledge, discover


hidden piece of information, new correlations, useful
pattern
Data Warehouse component
and
Building Data Warehouse
Building Data Warehouse
In general, building any data warehouse consists of the following steps:
1. Extracting the transactional data from the data sources
into a staging area
2. Transforming the transactional data into appropriate
form
3. Loading the transformed data into a multidimensional
database
4. Building pre-calculated summary values to speed up
report generation
5. Building (or purchasing) a front-end reporting tool
Extracting the transactional
• A large part of building a DW is pulling data from
various data sources and placing it in a central
storage area.
• In fact, this can be the most difficult step to accomplish
due to the reasons mentioned earlier:
• Most people who worked on the systems in place have moved
on to other jobs.
• Even if they haven't left the company, we still have a lot of
work to do.
• we need to figure out which database system to use for your
staging area and how to pull data from various sources into that
area.
Extracting the transactional
Microsoft has come up with an excellent tool for
data extraction.
Data Transformation Services (DTS), which is part
of Microsoft SQL Server 7.0 and 2000, allows us to
import and export data from any OLE(Online Object
Link) DB or ODBC-compliant database as long as
you have an appropriate provider.
Transforming Transactional Data
• Most companies have their data spread out in a
number of various database management systems:
MS Access, MS SQL Server, Oracle, Sybase, flat files,
spreadsheets, mail systems and other types of data
stores.
• When building a data warehouse, we need to relate
data from all of these sources and build some type of
a staging area that can handle data extracted from any
of these source systems.
Creating a Dimensional Model
• The third step in building a data warehouse is coming up with a
dimensional model.
• Most modern transactional systems are built using the
relational model. The relational database is highly normalized.
• The relational format is not very efficient when it comes to
building reports with summary and aggregate values.
• The dimensional approach, on the other hand, provides a way
to improve query performance without affecting data integrity.
• However, the query performance improvement comes with a
storage space penalty
Creating a Dimensional Model
• The dimensional model consists of the fact and dimension
tables.
• The fact tables consist of foreign keys to each dimension table,
as well as measures.
• The measures are a factual representation of how well (or how
poorly) your business is doing (for instance, the number of
parts produced per hour or the number of cars rented per
day).
• Dimensions, on the other hand, are what your business users
expect in the reports—the details about the measures.
• For example, the time dimension tells the user that 2000 parts
were produced between 7 a.m. and 7 p.m. on the specific day;
the plant dimension specifies that these parts were produced
by the Northern plant.
Loading the Data
• After you've built a dimensional model, it's time to populate it
with the data in the staging database.
• This step only sounds trivial. It might involve combining several
columns together or splitting one field into several columns.
• You might have to perform several lookups before calculating
certain values for your dimensional model.
Generating Precalculated Summary
Values
• The next step is generating the precalculated summary
values which are commonly referred to as aggregations.
• This step has been tremendously simplified by SQL Server
Analysis Services (or OLAP Services, as it is referred to in SQL
Server 7.0).
• However, remember that depending on the number of
dimensions , building aggregations can take a long time.
Building (or Purchasing) a Front-End
Reporting Tool
• After building the dimensional database and the aggregations
we can decide how sophisticated our reporting tools need to
be.
• If we just need the drill-down capabilities, and our users have
Microsoft Office 2000 on their desktops, the Pivot Table Service
of Microsoft Excel 2000 will do the job.
• If the reporting needs are more than what Excel can offer, we'll
have to investigate the alternative of building or purchasing a
reporting tool.
• The cost of building a custom reporting (and OLAP) tool will
usually outweigh the purchase price of a third-party tool.
Mapping the Data Warehouse to a
Multiprocessor Architecture

• Introduction and goals


• Database Architectures for Parallel Processing
• Shared-Memory Architecture
• Shared-Disk Architecture
• Shared-Nothing Architecture
Introduction and goals
• The goals of linear performance and scalability can be satisfied
by parallel hardware architectures, parallel operating systems,
and parallel DBMSs.
• Parallel hardware architectures are based on Multi-processor
systems designed as a Shared-memory model, Shared-disk
model or distributed-memory model.
• Parallelism can be achieved in three different ways:
1. Horizontal Parallelism (Database is partitioned across
different disks)
2. Vertical Parallelism (occurs among different tasks – all
components query operations i.e. scans, join, sort)
3. Data Partitioning
Database Architectures for Parallel Processing

There are three DBMS software architecture styles for parallel


processing:
Shared-memory Architecture -

multiple processors share the main memory space, as well as


mass storage (e.g. hard disk drives)
Shared Disk Architecture - each node has its own main
memory, but all nodes share mass storage, usually a storage
area network
Shared-nothing Architecture - each node has its own mass
storage as well as main memory.
Shared Memory Architecture

• Multiple PUs share memory.


• Each PU has full access to all shared memory through a
common bus.
• Communication between nodes occurs via shared memory.
• Performance is limited by the bandwidth of the memory
bus.
• It is simple to implement and provide a single system image,
implementing an RDBMS on SMP(symmetric multiprocessor)
• A disadvantage of shared memory systems for parallel
processing is as follows:
• Scalability is limited by bus bandwidth and latency, and by
available memory.
Shared Disk Architecture

• Each node consists of one or more PUs and associated


memory.
• Memory is not shared between nodes.
• Communication occurs over a common high-speed bus.
• Each node has access to the same disks and other resources.
• Bandwidth of the high-speed bus limits the number of nodes
(scalability) of the system.
• Parallel processing advantages of shared disk systems are as
follows:
• Shared disk systems permit high availability. All data is
accessible even if one node dies.
Shared Nothing Architecture

• Only one CPU is connected to a given disk.


• Adding more PUs and disks can improve scale up.
• Advantages
• Shared nothing systems provide for incremental growth.
• Failure is local: if one node fails, the others stay up.
• Disadvantages
• More coordination is required.
• More overhead is required for a process working on a disk
belonging to another node.
Mapping the Data Warehouse to a
Multiprocessor Architecture

• Introduction and goals


• Database Architectures for Parallel Processing
• Shared-Memory Architecture
• Shared-Disk Architecture
• Shared-Nothing Architecture
Introduction and goals
• The goals of linear performance and scalability can be satisfied
by parallel hardware architectures, parallel operating systems,
and parallel DBMSs.
• Parallel hardware architectures are based on Multi-processor
systems designed as a Shared-memory model, Shared-disk
model or distributed-memory model.
• Parallelism can be achieved in three different ways:
1. Horizontal Parallelism (Database is partitioned across
different disks)
2. Vertical Parallelism (occurs among different tasks – all
components query operations i.e. scans, join, sort)
3. Data Partitioning
Database Architectures for Parallel Processing

There are three DBMS software architecture styles for parallel


processing:
Shared-memory Architecture -

multiple processors share the main memory space, as well as


mass storage (e.g. hard disk drives)
Shared Disk Architecture - each node has its own main
memory, but all nodes share mass storage, usually a storage
area network
Shared-nothing Architecture - each node has its own mass
storage as well as main memory.
Shared Memory Architecture

• Multiple PUs share memory.


• Each PU has full access to all shared memory through a
common bus.
• Communication between nodes occurs via shared memory.
• Performance is limited by the bandwidth of the memory
bus.
• It is simple to implement and provide a single system image,
implementing an RDBMS on SMP(symmetric multiprocessor)
• A disadvantage of shared memory systems for parallel
processing is as follows:
• Scalability is limited by bus bandwidth and latency, and by
available memory.
Shared Disk Architecture

• Each node consists of one or more PUs and associated


memory.
• Memory is not shared between nodes.
• Communication occurs over a common high-speed bus.
• Each node has access to the same disks and other resources.
• Bandwidth of the high-speed bus limits the number of nodes
(scalability) of the system.
• Parallel processing advantages of shared disk systems are as
follows:
• Shared disk systems permit high availability. All data is
accessible even if one node dies.
Shared Nothing Architecture

• Only one CPU is connected to a given disk.


• Adding more PUs and disks can improve scale up.
• Advantages
• Shared nothing systems provide for incremental growth.
• Failure is local: if one node fails, the others stay up.
• Disadvantages
• More coordination is required.
• More overhead is required for a process working on a disk
belonging to another node.
Multi Dimensional Data models
and
Data Cubes

69
Multi dimensional Data models
• Multidimensional data models defined by fact and dimensions.
• Fact are numerical values such as total sales in dollar.
• Dimensions are entities or table with respect to organized record of information such as time
,item, location, suppliers.
• Three schema model are:

1. Star schema
2. Snowflake schema
3. Fact constellations

70
Star schema

 Star schema contain fact table in the middle surrounded


by set of dimension tables.
 Fact table contain primary keys of dimension tables
 Fact table contains bulk of data without redundancy
 Dimension table contain set of attributes
 Attributes in dimension may form hierarchy

71
Example of Star Schema
time
time_key item
day item_key
day_of_the_week Sales Fact Table item_name
month brand
quarter time_key type
year supplier_type
item_key
branch_key
branch location
location_key
branch_key location_key
branch_name units_sold street
branch_type city
dollars_sold province_or_street
country
avg_sales
Measures
72
Snowflake schema

• Snowflake schema: A refinement of star schema where some


dimensional hierarchy is normalized into a set of smaller
dimension tables, forming a shape similar to snowflake
• Snowflake are easy to maintain and save the storage space.
• Snowflake structure reduce the effectiveness of browsing due
to more join needed to execute query.
• It is less popular than star schema.

73
Example of Snowflake Schema
time
time_key item
day item_key supplier
day_of_the_week Sales Fact Table item_name supplier_key
month brand supplier_type
quarter time_key type
year item_key supplier_key

branch_key
branch location
location_key
location_key
branch_key
units_sold street
branch_name
city_key city
branch_type
dollars_sold
city_key
avg_sales city
province_or_
Measures street
country
74
Fact constellations

• Fact constellations: Multiple fact tables share dimension


tables, viewed as a collection of stars, therefore called galaxy
schema or fact constellation
• Sophisticated application require such schema

75
Example of Fact Constellation
time
time_key item Shipping Fact Table
day item_key
day_of_the_week Sales Fact Table item_name time_key
month brand
quarter time_key type item_key
year supplier_type shipper_key
item_key
branch_key from_location

branch location_key location to_location


branch_key location_key dollars_cost
branch_name
units_sold
street
branch_type dollars_sold city units_shipped
province_or_street
avg_sales country shipper
Measures shipper_key
shipper_name
location_key
76
shipper_type
A Concept Hierarchy: Dimension (location)

all all

region Europe ... North_America

country Germany ... Spain Canada ... Mexico

city Frankfurt ... Vancouver ... Toronto

office L. Chan ... M. Wind

77
Data Cubes

• A data cube allows data to be modeled and viewed in


multiple n dimensions. It is defined by dimensions and
fact.
• Dimensions are perspectives or entities in the database
and the cells in the data cube represent the measure of
interest. .
• Users of decision support systems often see data in the
form of data cubes.
• The cube is used to represent data along some measure
of interest.
• Although called a "cube", it can be 2-dimensional, 3-
dimensional, or higher-dimensional.
78
Data Cubes Examples

• AllElectronics may create a sales data warehouse in order to


keep records of the store’s sales with respect to the
dimensions time, item, branch, and location. These
dimensions allow the store to keep track of things like
monthly sales of items and the branches and locations at
which the items were sold.
A 3-D view of sales data for AllElectronics, according to the
dimensions time, item, and location. The measure displayed is
dollars sold (in thousands).

80
A 3-D data cube representation of the data according to the
dimensions time, item, and location. The measure displayed is
dollars sold (in thousands).

81
A 4-D data cube representation of sales data, according to the
dimensions time, item, location, and supplier. The measure
displayed is dollars sold (in thousands). For improved
readability, only some of the cube values are shown.

82
Cuboids Corresponding to the Cube

all
0-D(apex) cuboid
product date country
1-D cuboids

product,date product,country date, country


2-D cuboids

3-D(base) cuboid
product, date, country

83
4-D data cube

• Lattice of cuboids, making up a 4-D data cube for the dimensions


time, item, location, and supplier.
• Each cuboid represents a different degree of summarization.

84
85
Cube Operation

• Transform it into SQL-like language (with new operator cube by,


introduced by Gray et al.’96)
SELECT item, city, year, SUM (amount)
FROM SALES
CUBE BY item, city, year

• Need compute the following Group-By ()


(item, city, year),
(item, city), (item, year), (city, year),
(item), (city), (year) (city) (item) (year)
()

(city, item) (city, year) (item, year)

(city, item, year) 86


Process Architecture
• There are four major processes that contribute to a data warehouse −
• Extract and load the data.
• Cleaning and transforming the data.
• Backup and archive the data.
• Managing queries and directing them to the appropriate data sources.
Extract and Load Process
Controlling the Process:
• Controlling the process involves determining when to start data
extraction and the consistency check on data.
• Controlling process ensures that the tools, the logic modules,
and the programs are executed in correct sequence and at
correct time.
When to Initiate Extract:
• Data needs to be in a consistent state when it is extracted, i.e.,
the data warehouse should represent a single, consistent
version of the information to the user.
Loading the Data:
• After extracting the data, it is loaded into a temporary data
store where it is cleaned up and made consistent.
Clean and Transform Process
Clean and Transform the Loaded Data into a Structure:
• Cleaning and transforming the loaded data helps speed up the queries. It
can be done by making the data consistent −
• within itself.
• with other data within the same data source.
• with the data in other source systems.
• with the existing data present in the warehouse.
Partition the Data:
• It will optimize the hardware performance and simplify the management
of data warehouse. Here we partition each fact table into multiple
separate partitions.
Clean and Transform Process
Backup and Archive the Data:
• In order to recover the data in the event of data loss, software
failure, or hardware failure, it is necessary to keep regular back
ups.
• Archiving involves removing the old data from the system in a
format that allow it to be quickly restored whenever required.
Query Management Process:
• This process performs the following functions −
• manages the queries.
• helps speed up the execution time of queries.
• directs the queries to their most effective data sources.
• ensures that all the system sources are used in the most
effective way.
• monitors actual query profiles.
Thank You

You might also like