Data Warehousing 07012013132829 Data Warehousing
Data Warehousing 07012013132829 Data Warehousing
ABSTRACT
Data entering the data warehouse comes from operational environment in almost every
case. Data warehousing provides architectures and tools for business executives to syste-
matically organize ,understand ,and use their data to make stragetic decisions.A large number of
organizations have found that data warehouse systems are valuable tools in today’s
competive,fast-evolving world. In the last several years, many firms have spent millions of
dollars in building enterprise-wide data warehouses. Many people feel that with competition
mounting in every industry, data warehousing is the latest must have marketing weapon –a way
to keep customers by learning more about their needs.
So, you may ask, full of intrigue,” what exactly is a data warehose “. Data warehouses
have been defined in many ways, making it difficult to formulate a rigorous definition. Loosely
speaking, a data warehouse refers to a database that is maintened separately from an organisation
operational database. Data warehouse systems allow for integration of a variety of applications
systems. They support information processing by providing a solid platform of consolidated
historical data for analysis.
1.5 Architecture………………………………………………….
1.9.2 Rolap………………………………………….
1.9.3 Molap…………………………………………
1.13Data cleaning……………………………………………….
1.14Transportation…………………………………………….
1.16Application…………………………………………………
1.17Conclusion…………………………………………………
1.Introduction to data warehouse
“The data warehouse is always a physically separate store of data transformed from the
application data found in the operational environment”.
Data entering the data warehouse comes from operational environment in almost every case.
Data warehousing provides architectures and tools for business executives to syste-matically
organize, understand, and use their data to make stragetic decisions.A large number of
organizations have found that data warehouse systems are valuable tools in today’s
competive,fast-evolving world. In the last several years, many firms have spent millions of
dollars in building enterprise-wide data warehouses. Many people feel that with competition
mounting in every industry, data warehousing is the latest must have marketing weapon –a way
to keep customers by learning more about their needs.
So, you may ask, full of intrigue,” what exactly is a data warehose “. Data warehouses have been
defined in many ways, making it difficult to formulate a rigorous definition. Loosely speaking, a
data warehouse refers to a database that is maintened separately from an organisation operational
database. Data warehouse systems allow for integration of a variety of application systems.
They support information processing by providing a solid platform of consolidated historical
data for analysis.
Data warehousing is a more formalised methodology of these techniques. For example, many
sales analysis systems and executive information systems (EIS) get their data from summary
files rather then operational transaction files. The method
of using summary files instead of operational data is in essence what data warehousing is all
about. Some data warehousing tools neglect the importance of modelling and building a data
warehouse and focus on the storage and retrieval of data only. These tools might have strong
analytical facilities but lack the qualities you need to build and maintain a corporatewide data
warehouse. These tools belong on the PC rather than the host. Your corporate wide (or division
wide) data warehouse needs to be scalable, secure, operand, above all, suitable for publication.
· SCABLE means that your data warehouse must be able to handle both a growing volume and
variety of data and a growing number of users that can access it. Most companies prefer for this
reason to store their corporate wide data warehouse in a relational database above a multi
dimensional data base storage. (You can model your data dimensional and store it in a relational
database. More about dimensional modelling techniques later.
· SECURE means that your data warehouse administrator can centrally control who is.
allowed to access what data and when.
· OPEN means that the data in your data warehouse is open to a wide range of query another
front-end tools. For this reason, a relational data base should be your first choice for a corporate
wide data warehouse. The proprietary data storage structures that are used by some data
analysis tools can be fed from this central data warehouse.
The alignment around subject areas affects the design and implementation of the data
found in the data warehouse.
Most prominently, the major subject areas influence the most important part of the key
structure.
Data warehouse data excludes data that will not be used for DSS processing, while
operational application-oriented data contains data to satisfy immediate functional/processing
requirements that may or may not be of use to the DSS analyst.
Inserts, deletes, and changes - are done regularly to the operational environment on a
record-by-record basis.
o loading of data
o access of data
At the design level, the need to be cautious of the update anomaly is no factor in the data
warehouse since update of data is not done.
So, at the physical level of design, liberties can be taken to optimize the access of data,
particularly in dealing with the issues of normalization and physical denormalization.
Underlying technology used to run the data warehouse environment need not support
record-by-record update in an on-line mode.
The time horizon for the data warehouse is significantly longer than that of operational
systems.
Operational database contains current value data. Data warehouse data is nothing more than
a sophisticated series of snapshots, taken as of some moment in time.
The key structure of operational data may or may not contain some element of time. The key
structure of the data warehouse always contains some element of time.
Data warehouse data represents data over a long-time horizon - from five to ten years.
The time horizon represented for the operational environment is much shorter - from the
current values of today up to sixty to ninety days.
Applications that must perform well and must be available for transaction processing must
carry the minimum amount of data.
Data warehouse data is, for all practical purposes, a long series of snapshots. Operational
data, being accurate of access, can be updated as the need arises.
performance
o special data organization, access methods, and implementation methods are needed to
support operations typical of OLAP.
o Complex OLAP queries would degrade performance for operational transactions.
o Transaction and data integrity, detection and remedy of deadlocks and recovery are not
required.
Function
o missing data: Decision support requires historical data which operational DBs do not
typically maintain.
o data consolidation: DS requires consolidation (aggregation, summarization) of data from
heterogeneous sources: operational DBs, external sources.
o data quality: different sources typically use inconsistent data representations, codes and
formats which must be reconciled.
Operational
data source n Detailed data
DBMS OLAP(online
analytical processing)
tools
Operational Warehouse Manager
data store (ods)
Warehouse Manager
Operational
data source1
Operational OLAP(online
data source n
Detailed data DBMS analytical processing) tools
Archive/backup End-user
data access tools
Data Mart
summarized
data(Relational database)
Summarized data
(Second Tier)
(Multi-dimension database)
6 DATA MART
DATA MART à a subset of a data warehouse that supports the requirements of department or
business function.
• The characteristics that differentiate data marts and data warehouses include:
• a data mart focuses on only the requirements of users associated with one department or
business function.
• as data marts contain less data compared with data warehouses, data marts are more
easily understood and navigated.
• data marts do not normally contain detailed operational data, unlike data warehouse.
Reporting, query,
Warehouse Manager application
Operational
data source1 development, and EIS
Meta-flow (executive information
Meta-data High system) tools
summarized data
Inflow Outflow
Lightly
Load summarized
Manager Query OLAP (online
data
Upflow Manage analytical processing)
tools
Operational
data source n Detailed data
DBMS
Archive/backup
data
• Inflow- The processes associated with the extraction, cleansing, and loading of the data
from the source systems into the data warehouse.
• up flow- The process associated with adding value to the data in the warehouse through
summarizing, packaging, packaging, and distribution of the data.
• downflow- The processes associated with archiving and backing-up of data in the
warehouse.
• outflow- The process associated with making the data available to the end-users.
• Meta-flow- The processes associated with the management of the meta-data.
Server
e.g., MOLAP
Semi structured Analysis
Sources
Data
Warehouse serve
extract Query/Reporting
transform
load serve
refresh
etc. e.g., ROLAP
Operational
DB’s Data Mining
serve
Data Marts
(Tier 1)
Time Info
...
An expanded view of the model shows three dimensions: Time, Store and Product. Attribute
hierarchies are vertical relationships, while extended attribute characteristics are diagonal.
• The store dimension includes an extended attribute characteristic at a higher level: each
region has a Regional Manager.
• Attribute hierarchies imply aggregation of data: stores roll up into districts; districts into
regions.
Benefits: Easy to understand, easy to define hierarchies, reduces # of physical joins, low
maintenance, simple metadata
Drawbacks: Summary data in the fact table yields poorer performance for summary levels, huge
dimension tables a problem
The fact table may also contain partially consolidated data, such as sales dollars for a
region, for a given product for a given period.
Confusion is possible if ALL consolidated data isn't included. For example, if data is
consolidated only to the district level, a query for regional information will bring back no records
and, hence, report no sales activity. For this reason, simple stars MUST contain either.
ALL the combinations of aggregated data or
At least views of every combination
One approach is to create a multi-part key, identifying each record by the combination of
store/district/region. Using compound keys in a dimension table can cause problems:
It requires three separate metadata definitions to define a single relationship,
which adds to complexity in the design and sluggishness in performance.
Since the fact table must carry all three keys as part of its primary key, addition
or deletion of levels in the hierarchy (such as the addition of "territory" between store and
district) will require physical modification of the fact table, a time-consuming process that limits
flexibility.
Carrying all of the segments of the compound dimensional key in the fact table
increases the size of the crucial fact table index, a real determinant to both performance and
scalability.
The biggest drawback: dimension tables must carry a level indicator for every record and every
query must use it. In the example below, without the level constraint, keys for all stores in the
NORTH region, including aggregates for region and district will be pulled from the fact table,
resulting in error.
Example:
Level is a problem because it causes potential for error. If the query builder, human or
program, forgets about it, perfectly reasonable looking WRONG answers can occur.
One alternative: the FACT CONSTELLATION model (summary tables
The biggest drawback of the level indicator is that it limits flexibility (we may not know all of
the levels in the attribute hierarchies at first). By limiting ourselves to only certain levels, we
force a physical structure that may change, resulting in higher maintenance costs and more
downtime. The level concept is a useful tool for very controlled data warehouses, that is, those
that either have no ad hoc users or at least only those ad hoc users who are knowledgably about
the database. Particularly, when the results of queries are pre-formatted reports or extracts to
smaller systems, such as data marts, the drawbacks of the level indicator are not so evident.
The “Fact Constellation” Schema
The chart above is composed of all of the tables from the Classic Star, plus aggregated fact
(summary) tables. For example, the Store dimension is formed of a hierarchy of store-> district
-> region. The District fact table contains ONLY data aggregated by district, therefore there are
no records in the table with STORE_KEY matching any record for the Store dimension at the
store level. Therefore, when we scan the Store dimension table, and select keys that have district
= "Texas," they will only match STORE_KEY in the District fact table when the record is
aggregated for stores in the Texas district. No double (or triple, etc.) counting is possible and the
Level indicator is not needed. These aggregated fact tables can get complicated, though. For
example, we need a District and Region fact table, but what level of detail will they contain
about the product dimension? All of the following:
STORE/PROD DISTRICT/PROD REGION/PROD
STORE/BRAND DISTRICT/BRAND REGION/BRAND
STORE/MANUF DISTRICT/MANUF REGION/MANUF
And these are just the combinations from two dimensions!
In the Fact Constellations, aggregate tables are created separately from the detail, therefore, it is
impossible to pick up, for example, Store detail when querying the District Fact Table.
Major Advantage: No need for the “Level” indicator in the dimension tables,
since no aggregated data is stored with lower-level detail
Disadvantage: Dimension tables are still very large in some cases, which can slow performance;
front-end must be able to detect existence of aggregate facts, which requires more extensive
metadata
Fact Constellation is a good alternative to the Star, but when dimensions have very high
cardinality, the sub-selects in the dimension tables can be a source of delay.
Another drawback is that multiple SQL statements may be needed to answer a single
question, for example: measure the percent to total of a district to its region. Separate queries are
needed to both the district and region fact tables, then some complex "stitching" together of the
results is needed.
Once again, it is easy to see that even with its disadvantages, the Classic Star enjoys the
benefit of simplicity over its alternatives.
An alternative is to normalize the dimension tables by attribute level, with each smaller
dimension table pointing to an appropriate aggregated fact table, the “Snowflake Schema” ...
SNOWFLAKE SCHEMA
How does it work? The best way is for the query to be built by understanding which summary
levels exist, and finding the proper snow flaked attribute tables, constraining there for keys, then
selecting from the fact table.
The diagram above is a partial schema - it only shows the "snowflaking" of one dimension. In
fact, the product and time dimensions would be similarly decomposed as follows: Product -
product -> brand -> manufacturer (color and size are extended attribute characteristics of the
attribute "product," not part of the attribute hierarchy)
Time - day -> month -> quarter -> year
The process of extracting data from source systems and bring it into the data warehouse is
commonly called ELT, which stands for extraction, transformation, and loading. However, the
acronym ETL is perhaps too simplistic because it omits some other phases in the process of
creating data warehouse data from the data sources, such as data cleansing and transportation.
Here, I refer to the entire process for building a data warehouse, including the five phases
mentioned above, as ELT. In addition, after the data warehouse (detailed data) is created, several
data warehousing processes that are relevant to implementing and using the data warehouse are
needed, which include data summarization, data warehouse maintenance, date lineage tracing,
query rewriting, and data mining. In this section, I give a brief description for each process in the
data warehouse.
Extraction is the operation of extracting data from a source system for future use in a data
warehouse\environment. This is the first step of the ETL process. After extraction, data can be
transformed and loaded into the data warehouse. Extraction process does not need involve
complex algebraic database operations, such as join and aggregate functions. Its focus is
determining which data needs to be extracted, and bring the data into the data warehouse,
specifically, to the staging area. However, the data sources might be overly complex.
and poorly documented, so that designing and creating the extraction process is often the most
time-consuming task in the ELT process, and even in the entire data warehousing process. The
data must be extracted normally not only once, but several times in a periodic manner to supply
all changed data to the data warehouse and keep it up to date. Thus, data extraction is not only
used in the process of building the data warehouse, but also in the process of maintaining the
data warehouse. The extraction method should be chosen highly dependent on the data source
situation and the target of the data warehouse. Normally, the data sources cannot be modified,
nor can its performance or availability be adjusted by extraction process. Every often, entire
documents or tables from the data sources are extracted to the data warehouse or staging area,
and the data completely contain whole information from the data sources. There are two kinds of
logic extraction methods in data warehousing.
Full Extraction
The data is extracted completely from the data sources. As this extraction reflects all the data
currently available on the data source, there is no need to keep track of changes to the data source
since the last successful extraction. The source data will be provided as-is and no additional logic
information (e.g., timestamps) is necessary on the source site.
Incremental Extraction
At a specific point in time, only the data that has changed since a well-defined event back in
history will be extracted. The event may be the last time of extraction or a more complex
business event like the last sale day of a fiscal period. To identify this delta, change there must be
a possibility to identify all the changed information since this specific time event. This
information can be either provided by the source data itself, or a change table where an
appropriate additional mechanism keeps track of the changes besides the originating transaction.
in most case, using the latter method means adding extraction logic to the data source.
For the independence of data sources, many data warehouses do not use any change-capture
technique as part of the extraction process, instead, use full extraction logic. After full extracting,
the entire extracted data from the data sources can be compared with the previous extracted data
to identify the changed data. This approach may not have significant impact on the data source,
but it clearly can place a considerable burden on the data warehouse processes, particularly if the
data volumes are large. Incremental extraction, also called Change Data Capture, is an important
consideration for extraction. it is possible to make the ELT process much more efficient, and
especially, be used in the situation when I select incremental view maintenance as the
maintenance approach of the data warehouse. Unfortunately, for many source systems,
identifying the recently modified data may be difficult or intrusive.
to the operation of the data source. Change Data Capture is typically the most challenging
technical issue in data extraction.
Extracting data from remote data sources, especially heterogeneous databases, can bring a lot of
error and inconsistent information to the warehouse, i.e., duplicate, alias, abbreviation, and
synonym data. Data warehouses usually face this problem, in their role as repository for
information derived from multiple sources within and across enterprises. Thus, before
transporting data from staging area to the warehouse, data cleaning is normally required. Data
cleaning, also called data cleansing or scrubbing, is a process to deal with detecting and
removing errors and inconsistencies from data to improve the data quality.
The problems of data cleaning include single-source problems and multi-source problems.
Cleaning data for single source is cleaning ‘dirty’ data for each repository. This process involves
formatting and standardizing the source data, such as, adding a key to every source record;
according to the requirement of the warehouse, decomposing some dimension into sub-
dimension. E.g., decomposing Address dimension into Location, Street, City, and Zip. While
cleaning data for multi-source considers several data sources when doing the cleaning process, it
may include aggregating performance. For aggregating duplicated databases, the Customer and
Client databases are integrated into Customers database. In this process, both database structure
and database content could be changed.
All problems relating to data cleaning can be distinguished between schema-level and
physical(instance)- level. Schema-level problems are addressed at the schema level by an
improved schema design (schema evolution), schema translation and schema integration. They
of course are also reflected in the physical instances. Physical-level problems, on the other hand,
refer to errors and inconsistencies in the actual data contents which are not visible at the schema
level.
Transportation is an operational notation expressing the process moving data from one system to
another system. It is often one of the simpler portions of the ELT process, and can be integrated
with other portions of the process, such as extraction and transformation. In a data warehouse
environment, the most common requirements for transportation are in moving data from a source
system to a staging database or a data warehouse database, a staging database to a data
warehouse, and a data warehouse to a data mart.
Transportation is not often appearing in data warehouse literature, since its functionalities could
be split by other data warehousing processes, and so did in this report.
Transforming in Data Warehouse
Data transformation is a critical step in any data warehouse development e®ort. The data sources
of a data warehouse contain multiple data structures, while the data warehouse has a single kind
of data structure. Each heteroamorous data has to be transformed into a uniform structured data
before loading into the data warehouse. Since the data extracted from data sources is complex
with arbitrary data structure and the problems described in data transformations are often the
most complex, costly part of the ELT process, and combined with data cleaning process. Two
ways relating data transformation are often used in data warehousing:
Data loading is a process moving data from one data system to another system, especially, make
the data become accessible for the data warehouse. In this sense, data loading is a repeat notation
of data transportation. However, data loading is much more appearing in database literature, and
always connecting with data transformation in data warehousing since before any data
transformation can occur with a database, the raw data must become accessible for the database.
Loading the data into the database is one approach to solve this problem. Furthermore, in
practice, Load is an existing command in many commercial database systems and SQL language.
SQL*Loader is a tool used to move data from flat files into an Oracle data warehouse.
1.2 SECURITY IN DATA WAREHOUSING
Data warehouse is an integrated repository derived from multiple source (operational and legacy)
databases. The data warehouse is created by either replicating the different source data or
transforming them to new representation. This process involves reading, cleaning, aggregating,
and storing the data in the warehouse model. The software tools are used to access the warehouse
for strategic analysis, decision-making, marketing types of applications. It can be used for
inventory control of shelf stock in many departmental stores.
Medical and human genome researchers can create research data that can be either marketed or
used by a wide range of users. The information and access privileges in data warehouse should
mimic the constraints of source data. A recent trend is to create web-based data warehouses and
multiple users can create components of the warehouse and keep an environment that is open to
third party access and tools. Given the opportunity, users ask for lots of data in detail. Since
source data can be expensive, its privacy and security must be assured. The idea of adaptive
querying can be used to limit access after some data has been offered to the user. Based on the
user profile, the access to warehouse data can be restricted or modified.
In this talk, I will focus on the following ideas that can contribute towards warehouse security.
1. Replication control
Replication can be viewed in a slightly different manner than perceived in traditional literature.
For example, an old copy can be considered a replica of the current copy of the data. A slightly
out-of date data can be considered as a good substitute for some users. The basic idea is that
either the warehouse keeps different replicas of the same items or creates them dynamically. The
legitimate users get the most consistent and complete copy of data while casual users get a weak
replica. Such replica may be enough to satisfy the user's need but do not provide information that
can be used maliciously or breach privacy. We have formally defined the equivalence of replicas
and this notion can be used to create replicas for different users. The replicas may be at one
central site or can be distributed to proxies who may serve the users efficiently. In some cases,
the user may be given the weak replica and may be given an upgraded replica if wiling to pay or
deserves it.
The concept of warehouse is based on the idea of using summaries and consolidators. This
implies that source data is not available in raw form. This lends to ideas that can be used for
security. Some users can get aggregates only over many records whereas others can be given for
small data instances. The granularity of aggregation can be lowered for genuine users. The
generalization idea can be used to give users a high-level information at first, but the lower-level
details can be given after the security constraints are satisfied. For example, the user may be
given an approximate answer initially based on some generalization over the domains of the
database. Inheritance is another notion that will allow increasing capability of access for users.
The users can inherit access to related data after having access to some data item.
These concepts can be used to mutilate the data. A view may be available to support a particular
query, but the values may be overstated in the view. For security concern, quality of views may
depend on the user involved and user can be given an exaggerated view of the data. For example,
instead of giving any specific sales figures, views may scale up and give only exaggerated data.
In certain situations, warehouse data can give some misleading information; information which
may be partially incorrect or difficult to verify the correctness of the information. For example, a
view of a company’s annual report may contain the net profit figure including the profit from
sales of properties (not the actual sales of products).
4. Anonymity
Anonymity is to provide user and warehouse data privacy. A user does not know the source
warehouse for his query and warehouse also does not who is the user and what view a user is
accessing (view may be constructed from many source databases for that warehouse). Note that a
user must belong to the group of registered users and similarly, a user must also get data from
only legitimate warehouses. In such cases, encryption is to be used to secure the connection
between the users and warehouse so that no outside user (user who has not registered with the
warehouse) can access the warehouse.
User profile is a representation of the preferences of any individual user. User profiles can help
in authentication and determining the levels of security to access warehouse data. User profile
must describe how and what has to be represented pertaining to the user’s information and
security level authorization needs. The growth in warehouses has made relevant information
access difficult in reasonable time due to the large number of sources differ in terms of context
and representation. Warehouse can use data category details in determining the access control.
For example, if a user would like to access an unpublished annual company report, the
warehouse server may deny access to it. The other alternative is to construct a view to reflect
only projected sales and profit report. Such a construction of view may be transparent to the user.
A server can use data given in the profile to decide whether the user should be given the access
to associated graphical image data. The server has the option to reduce the resolution or later the
quality of images before making them available to users.
1.3 APPLICATION OF DATA WAREHOUSE
The value of a decision support system depends on its ability to provide the decision-maker with relevant,
information that can be acted upon at an appropriate time. This means that the information needs to be:
Applicable. The information must be current, pertinent to the field of interest and at the correct level of
detail to highlight any potential issues or benefits.
Conclusive. The information must be sufficient for the decision-maker to derive actions that will bring,
benefit to the organization.
Timely. The information must be available in a time frame that allows decisions to be effective.
Each of these requirements has implications for the characteristics of the underlying system. To be
affective, a decision support system requires access to all relevant data sources, potentially at a detailed
level. It must also be quick to return both ad-hoc and pre-defined results so that the decision-maker can
investigate to an appropriate level of depth without affecting performance in other areas.
One approach to creating a decision support system is to implement a data warehouse, which integrates,
existing sources of data with accessible data analysis techniques. An organizations data sources are
typically departmental or functional databases that have evolved to service specific and localized
requirements. Integrating such highly focused resources for decision support at the enterprise level
requires the addition of other functional capabilities:
Fast query handling. Data sources are normally optimized for data storage and processing, not for their,
speed of response to queries.
Increased data depth. Many business conclusions are based on the comparison of current data with
historical data. Data sources are normally focused on the present and so lack this depth.
Business language support. The decision-maker will typically have a background in business or
management, not in database programming. It is important that such a person can request information,
using words and not syntax.
A data warehouse meets these requirements by combining the data from the various and disparate
sources into a central repository on which analysis can be performed. This repository is normally a
relational database that provides both the capacity for extra data depth and the support for servicing
queries. Analytical functions are provided by a separate component which is optimized for extracting,
assembling and presenting summary information in response to word-based queries, such as “show me
last week’s sales by region”.
The diagram illustrates a simplified data warehouse for a manufacturing company. The departmental
and functional interfaces are intended to be indicative only.
The proliferation of data warehouses is highlighted by the “customer loyalty” schemes that are now run by
the many leading retailers and airlines. These schemes illustrate the potential of the data warehouse for
“micromarketing” and profitability calculations, but there are other applications of equal value, such as:
Stock control
Product category management
Basket analysis
Fraud analysis
All of these applications offer a direct payback to the customer by facilitating the identification of areas
that require attention. This payback, especially in the fields of fraud analysis and stock control, can be of
high and immediate value.
1.5 CONCLUSION:
1) Most retailers build data warehouses to target specific markets and customer
segments. They're trying to know their customers. It all starts with CDI -- customer.
data integration. By starting with CDI, the retailers can build the DW around the
customer.
2) On the other side -- there are retailers who have no idea who their customers are, or
feel they don’t need to…. the world is their customer and low prices will keep the world.
loyal. They use their data warehouse to control inventory and negotiate with suppliers.
The future will bring real time data warehouse updates…with the ability to give the
retailer a minute-to-minute view of what is going on in a retail location…and take.
action either manually or through a condition triggered by the data warehouse data…