0% found this document useful (0 votes)

41 views28 pages

Data Warehousing 07012013132829 Data Warehousing

data warehouing

Uploaded by

shiva rider

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOC, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

41 views28 pages

Data Warehousing 07012013132829 Data Warehousing

data warehouing

Uploaded by

shiva rider

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOC, PDF, TXT or read online on Scribd

You are on page 1/ 28

DATA WAREHOUSING

ABSTRACT

Data entering the data warehouse comes from operational environment in almost every
case. Data warehousing provides architectures and tools for business executives to syste-
matically organize ,understand ,and use their data to make stragetic decisions.A large number of
organizations have found that data warehouse systems are valuable tools in today’s
competive,fast-evolving world. In the last several years, many firms have spent millions of
dollars in building enterprise-wide data warehouses. Many people feel that with competition
mounting in every industry, data warehousing is the latest must have marketing weapon –a way
to keep customers by learning more about their needs.

So, you may ask, full of intrigue,” what exactly is a data warehose “. Data warehouses
have been defined in many ways, making it difficult to formulate a rigorous definition. Loosely
speaking, a data warehouse refers to a database that is maintened separately from an organisation
operational database. Data warehouse systems allow for integration of a variety of applications
systems. They support information processing by providing a solid platform of consolidated
historical data for analysis.

Data warehousing is a more formalised methodology of these techniques. For example,

many sales analysis systems and executive information systems (EIS) get their data from
summary files rather then operational transaction files. The method of using summary files
instead of operational data is in essence what data warehousing is all about. Some data
warehousing tools neglect the importance of modelling and building a data warehouse and focus
on the storage and retrieval of data only. These tools might have strong analytical facilities but
lack the qualities you need to build and maintain a corporatewide data warehouse. These tools
belong on the PC rather than the host. Your corporate wide (or division wide) data warehouse
needs to be scalable, secure, operand, above all, suitable for publication.
INDEX

1.1 Introduction to data warehouse……………………………

1.2 Comparison between data warehouse and database…….

1.3 Properties of data warehouse……………………………….

1.4 Need of data warehouse……………………………………

1.5 Architecture………………………………………………….

1.6 Data mart…………………………………………………….

1.7 Data flow in data warehouse…………………………………

1.8 Importance of meta data…………………………………….

1.9 Warehouse of database server……………………………….

1.9.1 Multidimensional data……………………….

1.9.2 Rolap………………………………………….

1.9.3 Molap…………………………………………

1.10Data warehouse design

1.11Data Warehousing process

1.12Extraction in data warehouse…………………………….

1.13Data cleaning……………………………………………….

1.14Transportation…………………………………………….

1.15Security in data warehouse……………………………….

1.16Application…………………………………………………

1.17Conclusion…………………………………………………
1.Introduction to data warehouse

“The data warehouse is always a physically separate store of data transformed from the
application data found in the operational environment”.

Data entering the data warehouse comes from operational environment in almost every case.
Data warehousing provides architectures and tools for business executives to syste-matically
organize, understand, and use their data to make stragetic decisions.A large number of
organizations have found that data warehouse systems are valuable tools in today’s
competive,fast-evolving world. In the last several years, many firms have spent millions of
dollars in building enterprise-wide data warehouses. Many people feel that with competition
mounting in every industry, data warehousing is the latest must have marketing weapon –a way
to keep customers by learning more about their needs.

So, you may ask, full of intrigue,” what exactly is a data warehose “. Data warehouses have been
defined in many ways, making it difficult to formulate a rigorous definition. Loosely speaking, a
data warehouse refers to a database that is maintened separately from an organisation operational
database. Data warehouse systems allow for integration of a variety of application systems.
They support information processing by providing a solid platform of consolidated historical
data for analysis.

Data warehousing is a more formalised methodology of these techniques. For example, many
sales analysis systems and executive information systems (EIS) get their data from summary
files rather then operational transaction files. The method
of using summary files instead of operational data is in essence what data warehousing is all
about. Some data warehousing tools neglect the importance of modelling and building a data
warehouse and focus on the storage and retrieval of data only. These tools might have strong
analytical facilities but lack the qualities you need to build and maintain a corporatewide data
warehouse. These tools belong on the PC rather than the host. Your corporate wide (or division
wide) data warehouse needs to be scalable, secure, operand, above all, suitable for publication.

· SCABLE means that your data warehouse must be able to handle both a growing volume and
variety of data and a growing number of users that can access it. Most companies prefer for this
reason to store their corporate wide data warehouse in a relational database above a multi
dimensional data base storage. (You can model your data dimensional and store it in a relational
database. More about dimensional modelling techniques later.

· SECURE means that your data warehouse administrator can centrally control who is.
allowed to access what data and when.
· OPEN means that the data in your data warehouse is open to a wide range of query another
front-end tools. For this reason, a relational data base should be your first choice for a corporate
wide data warehouse. The proprietary data storage structures that are used by some data
analysis tools can be fed from this central data warehouse.

2. Comparison of OLTP systems and data warehousing system

OLTP systems Data warehousing systems

Hold current data Holds historical data
Stores detailed data Stores detailed, lightly, and highly
Data is dynamic summarized data
Repetitive processing Data is largely static
High level of transaction throughput Ad hoc, unstructured, and heuristic processing
Predictable pattern of usage Medium to how level of transaction throughput
Transaction-driven Unpredictable pattern of usage
Application-orented Analysis driven
Supports day-to-day decisions Subject-oriented
Serves large number of clerical/operation users supports strategic decisions
Serves relatively how number of
managerial users
1.3 DATA WAREHOUSE - SUBJECT ORIENTED

 The alignment around subject areas affects the design and implementation of the data
found in the data warehouse.
 Most prominently, the major subject areas influence the most important part of the key
structure.

 The difference between application orientation and subject orientation shows up as a

difference in the content of the data at the detailed level as well.

 Data warehouse data excludes data that will not be used for DSS processing, while
operational application-oriented data contains data to satisfy immediate functional/processing
requirements that may or may not be of use to the DSS analyst.

DATA WAREHOUSE - INTEGRATED

 There is no consistency in encoding structures, naming conventions, measurement of

variables and physical attributes of data among different data sources.

 When data is moved to the warehouse, it is converted.

DATA WAREHOUSE – NON-VOLATILE

 Inserts, deletes, and changes - are done regularly to the operational environment on a
record-by-record basis.

 Only two kinds of operations occur on the warehouse data:

o loading of data
o access of data

 Update of data does not occur in the data warehouse environment.

 At the design level, the need to be cautious of the update anomaly is no factor in the data
warehouse since update of data is not done.
 So, at the physical level of design, liberties can be taken to optimize the access of data,
particularly in dealing with the issues of normalization and physical denormalization.

 Underlying technology used to run the data warehouse environment need not support
record-by-record update in an on-line mode.

DATA WAREHOUSE – TIME VARIENCE

 The time horizon for the data warehouse is significantly longer than that of operational
systems.
 Operational database contains current value data. Data warehouse data is nothing more than
a sophisticated series of snapshots, taken as of some moment in time.

 The key structure of operational data may or may not contain some element of time. The key
structure of the data warehouse always contains some element of time.

 Data warehouse data represents data over a long-time horizon - from five to ten years.

 The time horizon represented for the operational environment is much shorter - from the
current values of today up to sixty to ninety days.

 Applications that must perform well and must be available for transaction processing must
carry the minimum amount of data.

 Data warehouse data, once correctly recorded, cannot be updated.

 Data warehouse data is, for all practical purposes, a long series of snapshots. Operational
data, being accurate of access, can be updated as the need arises.

1.4 NEED OF DATA WAREHOUSE (2)

 performance

o special data organization, access methods, and implementation methods are needed to
support operations typical of OLAP.
o Complex OLAP queries would degrade performance for operational transactions.
o Transaction and data integrity, detection and remedy of deadlocks and recovery are not
required.

1.4 NEED OF DATA WAREHOUSE (2)

 Function
o missing data: Decision support requires historical data which operational DBs do not
typically maintain.
o data consolidation: DS requires consolidation (aggregation, summarization) of data from
heterogeneous sources: operational DBs, external sources.
o data quality: different sources typically use inconsistent data representations, codes and
formats which must be reconciled.

1.5 The architecture

Operational Reporting, query,

data source1
application
High development,
Meta-data summarized data and EIS(executive
information
Query Manage system)
Operational
Lightly tools
data source 2
Load Manager summarized
data

Operational
data source n Detailed data
DBMS OLAP(online
analytical processing)
tools
Operational Warehouse Manager
data store (ods)

Operational data store (ODS)

Data mining
Archive/backup
data
End-user
access tools
Typical architecture of a data warehouse

• OPERATIONAL DATA WAREHOUSE àfor the DW is supplied from mainframe operational

data held in first generation hierarchical and network databases, departmental data held in
proprietary file systems, private data held on workstations and private serves and external
systems such as the Internet, commercially available DB, or DB associated with and
organization’s suppliers or customers.
• OPERATIONAL DATABASEàis a repository of current and integrated operational data used
for analysis. It is often structured and supplied with data in the same way as the data warehouse
but may in fact simply act as a staging area for data to be moved into the warehouse.
• LOAD MANAGER àalso called the frontend component, its performance all the
operations associated with the extraction and loading of data into the warehouse. These
operations include simple transformations of the data to prepare the data for entry into the
warehouse.
• WAREHOUSE MANAGER àperforms all the operations associated with the management
of the data in the warehouse. The operations performed by this component include analysis of
data to ensure consistency, transformation and merging of source data, creation of indexes and
views, generation of denormalizations and aggregations, and archiving and backing-up data.

• QUERY MANAGERàalso called backend component, it performs all the operations

associated with the management of user queries. The operations performed by this component
include directing queries to the appropriate tables and scheduling the execution of queries.
• detailed, lightly and lightly summarized data, archive/backup data.
• meta-data.
• end-user access toolsàcan be categorized into five main groups: data reporting and query
tools, application development tools, executive information system (EIS) tools, online analytical
processing (OLAP) tools, and data mining tools.
•
•

Warehouse Manager
Operational
data source1

High Reporting, query,application

Meta-data development,
summarized data
and EIS(executive information system)
Operational Lightly tools
data source 2 Query
Load summarized Manage
Manager data

Operational OLAP(online
data source n
Detailed data DBMS analytical processing) tools

Operational Warehouse Manager

data store (ods)
Data mining
(First Tier)
(Third Tier)
Operational data store (ODS)

Archive/backup End-user
data access tools

Data Mart

summarized
data(Relational database)

Summarized data
(Second Tier)
(Multi-dimension database)

Typical data warehouse and data mart architecture

6 DATA MART

DATA MART à a subset of a data warehouse that supports the requirements of department or
business function.

• The characteristics that differentiate data marts and data warehouses include:
• a data mart focuses on only the requirements of users associated with one department or
business function.
• as data marts contain less data compared with data warehouses, data marts are more
easily understood and navigated.
• data marts do not normally contain detailed operational data, unlike data warehouse.

1.6.1 DATA MART ISSUE

 DATA MART SIZE: the performance deteriorates as data marts grow, so need to
reduce the size of data marts to gain improvements in performance.
 DATA MART FUNCTIONALITYàthe capabilities of data marts have increased with the
growth in their popularity.
 DATA MART PERFORMENCEàtwo critical components: end-user response time and
data loading performanceàto increment DB updating so that only cells affected by the
change are updated and not the entire MDDB structure.
 USER ACESS TO DIFFERENT MARTSàone approach is to replicate data between
different data marts or, alternatively, build virtual data martàit is views of several
physical data marts or the corporate data warehouse tailored to meet the requirements of
specific groups of users.
 DATA MART INTERNETàits products sit between a web server and the data analysis
product. Internet/intranet offers users low-cost access to data marts and the data WH
using web browsers.
 DATA MART ADMINISTRATIONàorganization can not easily perform administration
of multiple data marts, giving rise to issues such as data mart versioning, data and meta-
data consistency and integrity, enterprise-wide security, and performance tuning. Data
mart administrative tools are commercially available.
 DATA MART INSTALLATIONàdata marts are becoming increasingly complex to build.
Vendors are offering products referred to as” data mart in a box” that provide a low-cost
source of data mart tools.


Reporting, query,
Warehouse Manager application
Operational
data source1 development, and EIS
Meta-flow (executive information
Meta-data High system) tools
summarized data
Inflow Outflow
Lightly
Load summarized
Manager Query OLAP (online
data
Upflow Manage analytical processing)
tools
Operational
data source n Detailed data
DBMS

Operational Warehouse Manager

data store (ods)
Data mining tools
End-user
Downflow access tools

Archive/backup
data

Information flows of a data warehouse

1.7 DATA FLOW

• Inflow- The processes associated with the extraction, cleansing, and loading of the data
from the source systems into the data warehouse.
• up flow- The process associated with adding value to the data in the warehouse through
summarizing, packaging, packaging, and distribution of the data.
• downflow- The processes associated with archiving and backing-up of data in the
warehouse.
• outflow- The process associated with making the data available to the end-users.
• Meta-flow- The processes associated with the management of the meta-data.

Server

THE THREE TIER ARCHITECTURE

Information Sources Data Warehouse OLAP Servers Clients

(Tier 2) (Tier 3)

e.g., MOLAP
Semi structured Analysis
Sources
Data
Warehouse serve

extract Query/Reporting
transform
load serve
refresh
etc. e.g., ROLAP
Operational
DB’s Data Mining
serve

Data Marts

(Tier 1)

1. 8 THE IMPORTANCE OF META DATA

• The integration of meta-data, that is” data about data.”
• Meta-data is used for a variety of purposes and the management of it is a critical issue in
achieving a fully integrated data warehouse.
• The major purpose of meta-data is to show the pathway back to where the data began, so
that the warehouse administrators know the history of any item in the warehouse.
• The meta-data associated with data transformation and loading must describe the source
data and any changes that were made to the data.
• The meta-data associated with data management describes the data as it is stored in the
warehouse.
• The meta-data is required by the query manager to generate appropriate queries, also is
associated with the user of queries.
• The major integration issue is how to synchronize the various types of meta-data use
throughout the data warehouse. The challenge is to synchronize meta-data between different
products from different vendors using different meta-data stores.
• Two major standards for meta-data and modeling in the areas of data warehousing and
component-based development-MDC (Meta Data Coalition) and OMG (Object Management
Group)
• a data warehouse requires tools to support the administration and management of such
complex environment.
• for the various types of meta-data and the day-to-day operations of the data warehouse,
the administration and management tools must be capable of supporting those tasks:
• monitoring data loading from multiple sources
• data quality and integrity checks
• managing and updating meta-data
• monitoring database performance to ensure efficient query response times and resource
utilization.
Three-Tier Decision Support Systems

9 Warehouse database servers

Almost always a relational DBMS, rarely flat files
OLAP servers
Relational OLAP (ROLAP): extended relational DBMS that maps operations on
multidimensional data to standard relational operators
Multidimensional OLAP (MOLAP): special-purpose server that implements
multidimensional data and operations.
Clients
Query and reporting tools.
Analysis tools
Data mining tools

 Relational DBMS as Warehouse Servers

 Two possibilities for OLAP servers

 (1) Relational OLAP (ROLAP)

 Relational and specialized relational DBMS to store and manage warehouse data.
 OLAP middleware to support missing pieces.

 (2) Multidimensional OLAP (MOLAP)

 Array-based storage structures
 Direct access to array data structures
Multi-Dimensional Data

measure - numerical data being tracked:

Dimensions - business parameters that define a transaction.
Example: Analyst may want to view sales data (measure) by geography, by time, and by
product (dimensions).
Dimensional modeling is a technique for structuring data around the business concepts.
ER models describe “entities” and “relationships” vs dimensional models that describe
“measures” and “dimensions”.

Sales by product line over the past six months”

“Sales by store between 1990 and 1995”

Key columns joining fact table.

to dimension tables Numerical Measures

Prod Code Time Code Store Code Sales Qty

Fact table for

Product Info
measures

Time Info

...
An expanded view of the model shows three dimensions: Time, Store and Product. Attribute
hierarchies are vertical relationships, while extended attribute characteristics are diagonal.

• In the time dimension,

• a given date is further described by its extended attributes "current flag," "sequence" and
"day of the week."
• Extended attribute characteristics have no impact on granularity. The fact that February 4,
1996 is on a Sunday has no effect on the fact that we collect sales by day. In practice, though. we
may wish to compare Sunday sales to other days. We can do this by constraining our query on
the extended attribute characteristic "day of the week" without having to gather any additional
information.

• The store dimension includes an extended attribute characteristic at a higher level: each
region has a Regional Manager.

• Attribute hierarchies imply aggregation of data: stores roll up into districts; districts into
regions.

1. 10 ROLAP: Dimensional Modeling Using Relational DBMS

 Special schema design: star, snowflake

 Special indexes: bitmap, multi-table join.
 Special tuning: maximize query throughput.
 Proven technology (relational model, DBMS) tends to outperform specialized MDDB
especially on large data sets.
 Products

 IBM DB2, Oracle, Sybase IQ, Redbrick, Informix

The “Classic” Star Schema

Sto re Dim e nsio n Fa c t Ta b le Time Dime nsio n

STORE KEY STORE KEY
PERIOD KEY
Sto re De sc rip tio n PRODUCTKEY
City PERIOD KEY Pe rio d De sc
Sta te Ye a r
Do lla rs Qua rte r
Distric t ID
Units
Distric t De sc . Mo nth
Pric e
Re g io n_ID Da y
Re g io n De sc . Curre nt Fla g
Re g io na l Mg r.
Pro d uc t Dim e nsio n
Re so lutio n
Le ve l PRODUCTKEY Se q ue nc e
Pro d uc t De sc .
Bra nd
Co lo r
Size
Ma nufa c ture r
Le ve l

Benefits: Easy to understand, easy to define hierarchies, reduces # of physical joins, low
maintenance, simple metadata

Drawbacks: Summary data in the fact table yields poorer performance for summary levels, huge
dimension tables a problem

 The fact table may also contain partially consolidated data, such as sales dollars for a
region, for a given product for a given period.
 Confusion is possible if ALL consolidated data isn't included. For example, if data is
consolidated only to the district level, a query for regional information will bring back no records
and, hence, report no sales activity. For this reason, simple stars MUST contain either.
 ALL the combinations of aggregated data or
 At least views of every combination
One approach is to create a multi-part key, identifying each record by the combination of
store/district/region. Using compound keys in a dimension table can cause problems:
 It requires three separate metadata definitions to define a single relationship,
which adds to complexity in the design and sluggishness in performance.
 Since the fact table must carry all three keys as part of its primary key, addition
or deletion of levels in the hierarchy (such as the addition of "territory" between store and
district) will require physical modification of the fact table, a time-consuming process that limits
flexibility.
 Carrying all of the segments of the compound dimensional key in the fact table
increases the size of the crucial fact table index, a real determinant to both performance and
scalability.

The biggest drawback: dimension tables must carry a level indicator for every record and every
query must use it. In the example below, without the level constraint, keys for all stores in the
NORTH region, including aggregates for region and district will be pulled from the fact table,
resulting in error.
Example:

Select A. STORE_KEY, A. PERIOD_KEY, A Dollar from Fact Table A

where A. STORE_KEY in (select STORE_KEY
from Store Dimension B
where region = “North” and Level = 2)
etc...

 Level is a problem because it causes potential for error. If the query builder, human or
program, forgets about it, perfectly reasonable looking WRONG answers can occur.
 One alternative: the FACT CONSTELLATION model (summary tables
The biggest drawback of the level indicator is that it limits flexibility (we may not know all of
the levels in the attribute hierarchies at first). By limiting ourselves to only certain levels, we
force a physical structure that may change, resulting in higher maintenance costs and more
downtime. The level concept is a useful tool for very controlled data warehouses, that is, those
that either have no ad hoc users or at least only those ad hoc users who are knowledgably about
the database. Particularly, when the results of queries are pre-formatted reports or extracts to
smaller systems, such as data marts, the drawbacks of the level indicator are not so evident.
The “Fact Constellation” Schema

Sto re Dime nsio n Fa c t Ta ble Time Dime nsio n

STORE KEY STORE KEY
PERIOD KEY
Sto re De sc rip tio n PRODUCT KEY
City PERIOD KEY Pe rio d De sc
Sta te Ye a r
Do lla rs Qua rte r
Distric t ID
Units
Distric t De sc . Mo nth
Pric e
Re g io n_ID Da y
Re g io n De sc . Curre nt Fla g
Re g io na l Mg r.
Pro duc t Dime nsio n
Se q ue nc e
PRODUCTKEY
Pro d uc t De sc .
Bra nd
Co lo r District Fact Table
Size Region Fact Table
Ma nufa c ture r District ID
PRODUCT_KE
Y
PERIOD_KEY
Dollars
Dollars
Units Units
Price Price

The chart above is composed of all of the tables from the Classic Star, plus aggregated fact
(summary) tables. For example, the Store dimension is formed of a hierarchy of store-> district
-> region. The District fact table contains ONLY data aggregated by district, therefore there are
no records in the table with STORE_KEY matching any record for the Store dimension at the
store level. Therefore, when we scan the Store dimension table, and select keys that have district
= "Texas," they will only match STORE_KEY in the District fact table when the record is
aggregated for stores in the Texas district. No double (or triple, etc.) counting is possible and the
Level indicator is not needed. These aggregated fact tables can get complicated, though. For
example, we need a District and Region fact table, but what level of detail will they contain
about the product dimension? All of the following:
STORE/PROD DISTRICT/PROD REGION/PROD
STORE/BRAND DISTRICT/BRAND REGION/BRAND
STORE/MANUF DISTRICT/MANUF REGION/MANUF
And these are just the combinations from two dimensions!

In the Fact Constellations, aggregate tables are created separately from the detail, therefore, it is
impossible to pick up, for example, Store detail when querying the District Fact Table.

Major Advantage: No need for the “Level” indicator in the dimension tables,
since no aggregated data is stored with lower-level detail
Disadvantage: Dimension tables are still very large in some cases, which can slow performance;
front-end must be able to detect existence of aggregate facts, which requires more extensive
metadata

ANOTHER ALTERNATIVE TO LEVEL

 Fact Constellation is a good alternative to the Star, but when dimensions have very high
cardinality, the sub-selects in the dimension tables can be a source of delay.
 Another drawback is that multiple SQL statements may be needed to answer a single
question, for example: measure the percent to total of a district to its region. Separate queries are
needed to both the district and region fact tables, then some complex "stitching" together of the
results is needed.
 Once again, it is easy to see that even with its disadvantages, the Classic Star enjoys the
benefit of simplicity over its alternatives.
 An alternative is to normalize the dimension tables by attribute level, with each smaller
dimension table pointing to an appropriate aggregated fact table, the “Snowflake Schema” ...
SNOWFLAKE SCHEMA

St ore Dime ns ion

STORE KEY Dis t ric t _ ID Re gion_ ID
St o re De s cript io n Dis t rict De s c. Re g io n De s c.
Cit y Re g io n_ ID Re g io nal Mg r.
St at e
Dis t rict ID
Dis t rict De s c.
Re g io n_ ID
Re g io n De s c.
Re g io nal Mg r.
St ore Fact Table Dis t ric t Fact Table Re g io nFac t Table
Region_ID
STORE KEY District_ID
PRODUCT_KEY
PRODUCT_KEY PERIOD_KEY
PRODUCT KEY PERIOD_KEY Dollars
PERIOD KEY Do llars Unit s
Unit s Price
Do llars Pric e
Unit s
Price

 No LEVEL in dimension tables

 Dimension tables are normalized by decomposing at the attribute level.
 Each dimension table has one key for each level of the dimensional hierarchy.
 The lowest level key joins the dimension table to both the fact table and the lower-level
attribute table.

How does it work? The best way is for the query to be built by understanding which summary
levels exist, and finding the proper snow flaked attribute tables, constraining there for keys, then
selecting from the fact table.

Notice how the Store dimension table generates subsets of records.

First, all records from the table (where level = "District" in the Star) are extracted, and only those
attributes that refer to that level (District Description, for example) and the keys of the parent
hierarchy (Region_ID) are included in the table. Though the tables are subsets, it is absolutely
critical that column names are the same throughout the schema.

The diagram above is a partial schema - it only shows the "snowflaking" of one dimension. In
fact, the product and time dimensions would be similarly decomposed as follows: Product -
product -> brand -> manufacturer (color and size are extended attribute characteristics of the
attribute "product," not part of the attribute hierarchy)
Time - day -> month -> quarter -> year

 Additional Features: The original Store Dimension table, completely de-normalized, is

kept intact since certain queries can benefit by its all-encompassing content.
 In practice, start with a Star Schema and create the “snowflakes” with queries. This
eliminates the need to create separate extracts for each table, and referential integrity is inherited
from the dimension table.
 Advantage: Best performance when queries involve aggregation

 Disadvantage: Complicated maintenance and metadata, explosion in the number

 of tables in the database

1.11 Data Warehousing Processes

The process of extracting data from source systems and bring it into the data warehouse is
commonly called ELT, which stands for extraction, transformation, and loading. However, the
acronym ETL is perhaps too simplistic because it omits some other phases in the process of
creating data warehouse data from the data sources, such as data cleansing and transportation.
Here, I refer to the entire process for building a data warehouse, including the five phases
mentioned above, as ELT. In addition, after the data warehouse (detailed data) is created, several
data warehousing processes that are relevant to implementing and using the data warehouse are
needed, which include data summarization, data warehouse maintenance, date lineage tracing,
query rewriting, and data mining. In this section, I give a brief description for each process in the
data warehouse.

Extraction in Data Warehouse

Extraction is the operation of extracting data from a source system for future use in a data
warehouse\environment. This is the first step of the ETL process. After extraction, data can be
transformed and loaded into the data warehouse. Extraction process does not need involve
complex algebraic database operations, such as join and aggregate functions. Its focus is
determining which data needs to be extracted, and bring the data into the data warehouse,
specifically, to the staging area. However, the data sources might be overly complex.
and poorly documented, so that designing and creating the extraction process is often the most
time-consuming task in the ELT process, and even in the entire data warehousing process. The
data must be extracted normally not only once, but several times in a periodic manner to supply
all changed data to the data warehouse and keep it up to date. Thus, data extraction is not only
used in the process of building the data warehouse, but also in the process of maintaining the
data warehouse. The extraction method should be chosen highly dependent on the data source
situation and the target of the data warehouse. Normally, the data sources cannot be modified,
nor can its performance or availability be adjusted by extraction process. Every often, entire
documents or tables from the data sources are extracted to the data warehouse or staging area,
and the data completely contain whole information from the data sources. There are two kinds of
logic extraction methods in data warehousing.

Full Extraction

The data is extracted completely from the data sources. As this extraction reflects all the data
currently available on the data source, there is no need to keep track of changes to the data source
since the last successful extraction. The source data will be provided as-is and no additional logic
information (e.g., timestamps) is necessary on the source site.
Incremental Extraction

At a specific point in time, only the data that has changed since a well-defined event back in
history will be extracted. The event may be the last time of extraction or a more complex
business event like the last sale day of a fiscal period. To identify this delta, change there must be
a possibility to identify all the changed information since this specific time event. This
information can be either provided by the source data itself, or a change table where an
appropriate additional mechanism keeps track of the changes besides the originating transaction.
in most case, using the latter method means adding extraction logic to the data source.
For the independence of data sources, many data warehouses do not use any change-capture
technique as part of the extraction process, instead, use full extraction logic. After full extracting,
the entire extracted data from the data sources can be compared with the previous extracted data
to identify the changed data. This approach may not have significant impact on the data source,
but it clearly can place a considerable burden on the data warehouse processes, particularly if the
data volumes are large. Incremental extraction, also called Change Data Capture, is an important
consideration for extraction. it is possible to make the ELT process much more efficient, and
especially, be used in the situation when I select incremental view maintenance as the
maintenance approach of the data warehouse. Unfortunately, for many source systems,
identifying the recently modified data may be difficult or intrusive.
to the operation of the data source. Change Data Capture is typically the most challenging
technical issue in data extraction.

Data Cleaning in Data Warehouse

Extracting data from remote data sources, especially heterogeneous databases, can bring a lot of
error and inconsistent information to the warehouse, i.e., duplicate, alias, abbreviation, and
synonym data. Data warehouses usually face this problem, in their role as repository for
information derived from multiple sources within and across enterprises. Thus, before
transporting data from staging area to the warehouse, data cleaning is normally required. Data
cleaning, also called data cleansing or scrubbing, is a process to deal with detecting and
removing errors and inconsistencies from data to improve the data quality.
The problems of data cleaning include single-source problems and multi-source problems.
Cleaning data for single source is cleaning ‘dirty’ data for each repository. This process involves
formatting and standardizing the source data, such as, adding a key to every source record;
according to the requirement of the warehouse, decomposing some dimension into sub-
dimension. E.g., decomposing Address dimension into Location, Street, City, and Zip. While
cleaning data for multi-source considers several data sources when doing the cleaning process, it
may include aggregating performance. For aggregating duplicated databases, the Customer and
Client databases are integrated into Customers database. In this process, both database structure
and database content could be changed.

All problems relating to data cleaning can be distinguished between schema-level and
physical(instance)- level. Schema-level problems are addressed at the schema level by an
improved schema design (schema evolution), schema translation and schema integration. They
of course are also reflected in the physical instances. Physical-level problems, on the other hand,
refer to errors and inconsistencies in the actual data contents which are not visible at the schema
level.

Transportation in Data Warehouse

Transportation is an operational notation expressing the process moving data from one system to
another system. It is often one of the simpler portions of the ELT process, and can be integrated
with other portions of the process, such as extraction and transformation. In a data warehouse
environment, the most common requirements for transportation are in moving data from a source
system to a staging database or a data warehouse database, a staging database to a data
warehouse, and a data warehouse to a data mart.
Transportation is not often appearing in data warehouse literature, since its functionalities could
be split by other data warehousing processes, and so did in this report.
Transforming in Data Warehouse
Data transformation is a critical step in any data warehouse development e®ort. The data sources
of a data warehouse contain multiple data structures, while the data warehouse has a single kind
of data structure. Each heteroamorous data has to be transformed into a uniform structured data
before loading into the data warehouse. Since the data extracted from data sources is complex
with arbitrary data structure and the problems described in data transformations are often the
most complex, costly part of the ELT process, and combined with data cleaning process. Two
ways relating data transformation are often used in data warehousing:

1. Multistage Data Transformation

2. Pipelined Data Transformation
.
The former is implementing each different transformation as a separate operation and creating a
separate, temporary staging tables to store the incremental results for each step. It is a common
strategy and enables to make the transformation process more easily monitored and restarted.
However, a disadvantage to multistage transformation is that the space and time requirements
increase. In contrast, pipelined transformation is a process of transform while loading. There is
no temporary staging table in the middle of the transformation process. It consequently increases
the docility of monitoring and the requirement of the similarity between the source and the target
data.

Loading in Data Warehouse

Data loading is a process moving data from one data system to another system, especially, make
the data become accessible for the data warehouse. In this sense, data loading is a repeat notation
of data transportation. However, data loading is much more appearing in database literature, and
always connecting with data transformation in data warehousing since before any data
transformation can occur with a database, the raw data must become accessible for the database.
Loading the data into the database is one approach to solve this problem. Furthermore, in
practice, Load is an existing command in many commercial database systems and SQL language.
SQL*Loader is a tool used to move data from flat files into an Oracle data warehouse.
1.2 SECURITY IN DATA WAREHOUSING

Data warehouse is an integrated repository derived from multiple source (operational and legacy)
databases. The data warehouse is created by either replicating the different source data or
transforming them to new representation. This process involves reading, cleaning, aggregating,
and storing the data in the warehouse model. The software tools are used to access the warehouse
for strategic analysis, decision-making, marketing types of applications. It can be used for
inventory control of shelf stock in many departmental stores.

Medical and human genome researchers can create research data that can be either marketed or
used by a wide range of users. The information and access privileges in data warehouse should
mimic the constraints of source data. A recent trend is to create web-based data warehouses and
multiple users can create components of the warehouse and keep an environment that is open to
third party access and tools. Given the opportunity, users ask for lots of data in detail. Since
source data can be expensive, its privacy and security must be assured. The idea of adaptive
querying can be used to limit access after some data has been offered to the user. Based on the
user profile, the access to warehouse data can be restricted or modified.

In this talk, I will focus on the following ideas that can contribute towards warehouse security.

1. Replication control

Replication can be viewed in a slightly different manner than perceived in traditional literature.
For example, an old copy can be considered a replica of the current copy of the data. A slightly
out-of date data can be considered as a good substitute for some users. The basic idea is that
either the warehouse keeps different replicas of the same items or creates them dynamically. The
legitimate users get the most consistent and complete copy of data while casual users get a weak
replica. Such replica may be enough to satisfy the user's need but do not provide information that
can be used maliciously or breach privacy. We have formally defined the equivalence of replicas
and this notion can be used to create replicas for different users. The replicas may be at one
central site or can be distributed to proxies who may serve the users efficiently. In some cases,
the user may be given the weak replica and may be given an upgraded replica if wiling to pay or
deserves it.

2. Aggregation and Generalization

The concept of warehouse is based on the idea of using summaries and consolidators. This
implies that source data is not available in raw form. This lends to ideas that can be used for
security. Some users can get aggregates only over many records whereas others can be given for
small data instances. The granularity of aggregation can be lowered for genuine users. The
generalization idea can be used to give users a high-level information at first, but the lower-level
details can be given after the security constraints are satisfied. For example, the user may be
given an approximate answer initially based on some generalization over the domains of the
database. Inheritance is another notion that will allow increasing capability of access for users.
The users can inherit access to related data after having access to some data item.

3. Exaggeration and Misleading

These concepts can be used to mutilate the data. A view may be available to support a particular
query, but the values may be overstated in the view. For security concern, quality of views may
depend on the user involved and user can be given an exaggerated view of the data. For example,
instead of giving any specific sales figures, views may scale up and give only exaggerated data.
In certain situations, warehouse data can give some misleading information; information which
may be partially incorrect or difficult to verify the correctness of the information. For example, a
view of a company’s annual report may contain the net profit figure including the profit from
sales of properties (not the actual sales of products).

4. Anonymity

Anonymity is to provide user and warehouse data privacy. A user does not know the source
warehouse for his query and warehouse also does not who is the user and what view a user is
accessing (view may be constructed from many source databases for that warehouse). Note that a
user must belong to the group of registered users and similarly, a user must also get data from
only legitimate warehouses. In such cases, encryption is to be used to secure the connection
between the users and warehouse so that no outside user (user who has not registered with the
warehouse) can access the warehouse.

5. User Profile Based Security

User profile is a representation of the preferences of any individual user. User profiles can help
in authentication and determining the levels of security to access warehouse data. User profile
must describe how and what has to be represented pertaining to the user’s information and
security level authorization needs. The growth in warehouses has made relevant information
access difficult in reasonable time due to the large number of sources differ in terms of context
and representation. Warehouse can use data category details in determining the access control.
For example, if a user would like to access an unpublished annual company report, the
warehouse server may deny access to it. The other alternative is to construct a view to reflect
only projected sales and profit report. Such a construction of view may be transparent to the user.
A server can use data given in the profile to decide whether the user should be given the access
to associated graphical image data. The server has the option to reduce the resolution or later the
quality of images before making them available to users.
1.3 APPLICATION OF DATA WAREHOUSE

Exploiting Data for Business Decisions

The value of a decision support system depends on its ability to provide the decision-maker with relevant,
information that can be acted upon at an appropriate time. This means that the information needs to be:

Applicable. The information must be current, pertinent to the field of interest and at the correct level of
detail to highlight any potential issues or benefits.

Conclusive. The information must be sufficient for the decision-maker to derive actions that will bring,
benefit to the organization.

Timely. The information must be available in a time frame that allows decisions to be effective.
Each of these requirements has implications for the characteristics of the underlying system. To be
affective, a decision support system requires access to all relevant data sources, potentially at a detailed
level. It must also be quick to return both ad-hoc and pre-defined results so that the decision-maker can
investigate to an appropriate level of depth without affecting performance in other areas.

Decision Support through Data Warehousing

One approach to creating a decision support system is to implement a data warehouse, which integrates,
existing sources of data with accessible data analysis techniques. An organizations data sources are
typically departmental or functional databases that have evolved to service specific and localized
requirements. Integrating such highly focused resources for decision support at the enterprise level
requires the addition of other functional capabilities:

Fast query handling. Data sources are normally optimized for data storage and processing, not for their,
speed of response to queries.

Increased data depth. Many business conclusions are based on the comparison of current data with
historical data. Data sources are normally focused on the present and so lack this depth.

Business language support. The decision-maker will typically have a background in business or
management, not in database programming. It is important that such a person can request information,
using words and not syntax.

A data warehouse meets these requirements by combining the data from the various and disparate
sources into a central repository on which analysis can be performed. This repository is normally a
relational database that provides both the capacity for extra data depth and the support for servicing
queries. Analytical functions are provided by a separate component which is optimized for extracting,
assembling and presenting summary information in response to word-based queries, such as “show me
last week’s sales by region”.
The diagram illustrates a simplified data warehouse for a manufacturing company. The departmental
and functional interfaces are intended to be indicative only.

The Application of Data Warehouses

The proliferation of data warehouses is highlighted by the “customer loyalty” schemes that are now run by
the many leading retailers and airlines. These schemes illustrate the potential of the data warehouse for
“micromarketing” and profitability calculations, but there are other applications of equal value, such as:
Stock control
Product category management
Basket analysis
Fraud analysis
All of these applications offer a direct payback to the customer by facilitating the identification of areas
that require attention. This payback, especially in the fields of fraud analysis and stock control, can be of
high and immediate value.
1.5 CONCLUSION:

What we are seeing is two-fold depending on the retailer's strategy:

1) Most retailers build data warehouses to target specific markets and customer
segments. They're trying to know their customers. It all starts with CDI -- customer.
data integration. By starting with CDI, the retailers can build the DW around the
customer.

2) On the other side -- there are retailers who have no idea who their customers are, or
feel they don’t need to…. the world is their customer and low prices will keep the world.
loyal. They use their data warehouse to control inventory and negotiate with suppliers.
The future will bring real time data warehouse updates…with the ability to give the
retailer a minute-to-minute view of what is going on in a retail location…and take.
action either manually or through a condition triggered by the data warehouse data…

The future belongs to those who:

1) Possess knowledge of the Customer and
2) Effectively use that knowledge…

SPSS Final Questions and Answers
50% (2)
SPSS Final Questions and Answers
13 pages
Ccs341 DW Notes All 5 Units
100% (1)
Ccs341 DW Notes All 5 Units
159 pages
Data Mining Final New
No ratings yet
Data Mining Final New
109 pages
Microsoft Word - D365CE Interview QA
100% (10)
Microsoft Word - D365CE Interview QA
53 pages
Data Warehousing
No ratings yet
Data Warehousing
77 pages
Data Warehouse Lecture#8
No ratings yet
Data Warehouse Lecture#8
47 pages
DWDM 011
No ratings yet
DWDM 011
48 pages
DMW Unit 1
No ratings yet
DMW Unit 1
56 pages
DWDM Unit 1
No ratings yet
DWDM Unit 1
122 pages
What Is Data Warehouse
No ratings yet
What Is Data Warehouse
19 pages
Module 1-1basic Concepts
No ratings yet
Module 1-1basic Concepts
40 pages
Data Warehousing Concepts
No ratings yet
Data Warehousing Concepts
87 pages
1 & 2 Data Warehousing - 021052
No ratings yet
1 & 2 Data Warehousing - 021052
80 pages
Week-2-Data Warehouse and Olap
No ratings yet
Week-2-Data Warehouse and Olap
57 pages
SageX3 Architecture Guide V12
100% (1)
SageX3 Architecture Guide V12
28 pages
02 DataWarehousing and OLAP
No ratings yet
02 DataWarehousing and OLAP
66 pages
Data Warehousing-Notes (Module - I & II)
No ratings yet
Data Warehousing-Notes (Module - I & II)
32 pages
Data Warehousing
No ratings yet
Data Warehousing
71 pages
DWM CHP1 Notes
No ratings yet
DWM CHP1 Notes
25 pages
Data Warehouse
No ratings yet
Data Warehouse
22 pages
Introduction To Warehousing
No ratings yet
Introduction To Warehousing
21 pages
Data Warehouse Unit-3 Complete
No ratings yet
Data Warehouse Unit-3 Complete
31 pages
Data Warehousing and Data Mining
100% (1)
Data Warehousing and Data Mining
48 pages
Data War Eh Puse
No ratings yet
Data War Eh Puse
51 pages
Introduction To Data Warehouse Edited
No ratings yet
Introduction To Data Warehouse Edited
34 pages
Advance Concept in Data Bases Unit-5 by Arun Pratap Singh
100% (1)
Advance Concept in Data Bases Unit-5 by Arun Pratap Singh
82 pages
Chapter 1
No ratings yet
Chapter 1
22 pages
Data Warehousing
No ratings yet
Data Warehousing
16 pages
Practice Exam Oracle Cloud Database Services 2024 Professional Certification
100% (1)
Practice Exam Oracle Cloud Database Services 2024 Professional Certification
4 pages
Data Mining and Warehousing
No ratings yet
Data Mining and Warehousing
18 pages
Introduction To Data Warehouse
No ratings yet
Introduction To Data Warehouse
34 pages
Unit 1
No ratings yet
Unit 1
34 pages
Bi Units F
No ratings yet
Bi Units F
53 pages
What Is A Data Warehouse?
No ratings yet
What Is A Data Warehouse?
48 pages
Module 3
No ratings yet
Module 3
17 pages
Data Warehouse
No ratings yet
Data Warehouse
16 pages
Notes DWDM
No ratings yet
Notes DWDM
12 pages
Intro To DW
No ratings yet
Intro To DW
5 pages
What Is A Data Warehouse?
No ratings yet
What Is A Data Warehouse?
48 pages
DWH Start l2
No ratings yet
DWH Start l2
117 pages
JVP 42019
No ratings yet
JVP 42019
10 pages
Data Warehouse Unit 1
No ratings yet
Data Warehouse Unit 1
7 pages
Data Warehouse Notes
No ratings yet
Data Warehouse Notes
41 pages
UNITyssu 1 LT
No ratings yet
UNITyssu 1 LT
12 pages
Course Overview: What Is Data Warehouse
No ratings yet
Course Overview: What Is Data Warehouse
75 pages
DWM Unit-I Notes
No ratings yet
DWM Unit-I Notes
9 pages
Introduction To Data Warehousing
No ratings yet
Introduction To Data Warehousing
5 pages
Data Warehouse
No ratings yet
Data Warehouse
4 pages
2.data Warehousing: Heterogeneous Database Integration
No ratings yet
2.data Warehousing: Heterogeneous Database Integration
26 pages
R16 4-2 DataMining Notes UNIT-I
No ratings yet
R16 4-2 DataMining Notes UNIT-I
31 pages
JVP 42019
No ratings yet
JVP 42019
10 pages
Data Warehouse: Concepts, Architecture and Components
No ratings yet
Data Warehouse: Concepts, Architecture and Components
5 pages
Presentation Prepared By:: Aqsa Ashfaq
No ratings yet
Presentation Prepared By:: Aqsa Ashfaq
22 pages
Respostas Prova para Exame OCI
No ratings yet
Respostas Prova para Exame OCI
8 pages
Eval of Business Performance - Module 1
No ratings yet
Eval of Business Performance - Module 1
8 pages
Data Warehousing: Data Entering The Data Warehouse Comes From Operational Environment in Almost Every
No ratings yet
Data Warehousing: Data Entering The Data Warehouse Comes From Operational Environment in Almost Every
1 page
DWH Fundamentals (Training Material)
No ratings yet
DWH Fundamentals (Training Material)
21 pages
COMP9313: Big Data Management
No ratings yet
COMP9313: Big Data Management
79 pages
Project Introduction: Chinook Database
No ratings yet
Project Introduction: Chinook Database
42 pages
Unit 3 Introduction To Data Warehousing: Structure Page Nos
No ratings yet
Unit 3 Introduction To Data Warehousing: Structure Page Nos
21 pages
MSexam Stat 2016S Solution
No ratings yet
MSexam Stat 2016S Solution
11 pages
Data Ware Housing
No ratings yet
Data Ware Housing
10 pages
Warehousing
No ratings yet
Warehousing
15 pages
Data Mining Warehousing I & II
No ratings yet
Data Mining Warehousing I & II
7 pages
AWS DevOps Course Syllabus
No ratings yet
AWS DevOps Course Syllabus
6 pages
Data Management
No ratings yet
Data Management
8 pages
LecturePlan MC305 23CAH-551
No ratings yet
LecturePlan MC305 23CAH-551
6 pages
Data Ware House
No ratings yet
Data Ware House
6 pages
Introduction To RDBMS ORDBMS
No ratings yet
Introduction To RDBMS ORDBMS
5 pages
DADV - Lab - Subject - 303105315
No ratings yet
DADV - Lab - Subject - 303105315
35 pages
Chapter 01 - JPA, JPA Mapping
No ratings yet
Chapter 01 - JPA, JPA Mapping
75 pages
Lecture 6 - Dimensional Modeling
No ratings yet
Lecture 6 - Dimensional Modeling
99 pages
Chapter 3
No ratings yet
Chapter 3
61 pages
Simba Apache Spark ODBC Connector Install and Configuration Guide
No ratings yet
Simba Apache Spark ODBC Connector Install and Configuration Guide
125 pages
Veeam Backup User Permitions For VM Files and SQL PDF
No ratings yet
Veeam Backup User Permitions For VM Files and SQL PDF
3 pages
Data Warehousing and Data Mining Dec 2023
No ratings yet
Data Warehousing and Data Mining Dec 2023
28 pages
What Is Cohort Analysis
No ratings yet
What Is Cohort Analysis
10 pages
Database Practical
No ratings yet
Database Practical
10 pages
Elmasri 6e Ch21 Transaction Processing
No ratings yet
Elmasri 6e Ch21 Transaction Processing
82 pages
Resume Nuhi
No ratings yet
Resume Nuhi
5 pages
CS - SQL Injection Attack (Discussion-2)
No ratings yet
CS - SQL Injection Attack (Discussion-2)
3 pages
Chapter - 1/1
No ratings yet
Chapter - 1/1
8 pages
Simrad Cs68 Ecdis Sensors & Safety Parameters Settings Checklist
No ratings yet
Simrad Cs68 Ecdis Sensors & Safety Parameters Settings Checklist
2 pages
CCI 4301 - Advanced Database Management Systems - Course Outline
No ratings yet
CCI 4301 - Advanced Database Management Systems - Course Outline
3 pages
Java Technologies: Biyani's Think Tank
No ratings yet
Java Technologies: Biyani's Think Tank
52 pages
Cloud Sync POS Chapter 1
No ratings yet
Cloud Sync POS Chapter 1
5 pages
Introduccion A Spark
No ratings yet
Introduccion A Spark
22 pages