Chapter 12 - Data Warehousing and Online Analytical Processing

Download as pdf or txt
Download as pdf or txt
You are on page 1of 20

3/6/2011 Chapter 12 - Data Warehousing and Onl…

Chapter 12 - Data Warehousing and


Online Analytical Processing
A data warehouse is often used as the basis for a decision-support system (also referred to from an analytical
perspective as a business intelligence system). It is designed to overcome some of the problems encountered
when an organization attempts to perform strategic analysis using the same database that is used to perform
online transaction processing (OLTP).

A typical OLTP system is characterized by having large numbers of concurrent users actively adding and
modifying data. The database represents the state of a particular business function at a specific point in time,
such as an airline reservation system. However, the large volume of data maintained in many OLTP systems can
overwhelm an organization. As databases grow larger with more complex data, response time can deteriorate
quickly due to competition for available resources. A typical OLTP system has many users adding new data to
the database while fewer users generate reports from the database. As the volume of data increases, reports
take longer to generate.

As organizations collect increasing volumes of data by using OLTP database systems, the need to analyze data
becomes more acute. Typically, OLTP systems are designed specifically to manage transaction processing and
minimize disk storage requirements by a series of related, normalized tables. However, when users need to
analyze their data, a myriad of problems often prohibits the data from being used:

Users may not understand the complex relationships among the tables, and therefore cannot generate ad
hoc queries.

Application databases may be segmented across multiple servers, making it difficult for users to find the
tables in the first place.

Security restrictions may prevent users from accessing the detail data they need.

Database administrators prohibit ad hoc querying of OLTP systems, to prevent analytical users from
running queries that could slow down the performance of mission-critical production databases.

By copying an OLTP system to a reporting server on a regularly scheduled basis, an organization can improve
response time for reports and queries. Yet a schema optimized for OLTP is often not flexible enough for decision
support applications, largely due to the volume of data involved and the complexity of normalized relational
tables.

For example, each regional sales manager in a company may wish to produce a monthly summary of the sales
per region. Because the reporting server contains data at the same level of detail as the OLTP system, the
entire month's data is summarized each time the report is generated. The result is longer-running queries that
lower user satisfaction.

Additionally, many organizations store data in multiple heterogeneous database systems. Reporting is more
difficult because data is not only stored in different places, but in different formats.

Data warehousing and online analytical processing (OLAP) provide solutions to these problems. Data
warehousing is an approach to storing data in which heterogeneous data sources (typically from multiple OLTP
databases) are migrated to a separate homogenous data store. Data warehouses provide these benefits to
analytical users:

Data is organized to facilitate analytical queries rather than transaction processing.

Differences among data structures across multiple heterogeneous databases can be resolved.

Data transformation rules can be applied to validate and consolidate data when data is moved from the
OLTP database into the data warehouse.

Security and performance issues can be resolved without requiring changes in the production systems.

Sometimes organizations maintain smaller, more topic-oriented data stores called data marts. In contrast to a
data warehouse which typically encapsulates all of an enterprise's analytical data, a data mart is typically a

…microsoft.com/…/cc917548(printer).aspx 1/20
3/6/2011 Chapter 12 - Data Warehousing and Onl…
subset of the enterprise data targeted at a smaller set of users or business functions.

Whereas a data warehouse or data mart are the data stores for analytical data, OLAP is the technology that
enables client applications to efficiently access the data. OLAP provides these benefits to analytical users:

Pre-aggregation of frequently queried data, enabling a very fast response time to ad hoc queries.

An intuitive multidimensional data model that makes it easy to select, navigate, and explore the data.

A powerful tool for creating new views of data based upon a rich array of ad hoc calculation functions.

Technology to manage security, client/server query management and data caching, and facilities to
optimize system performance based upon user needs.

The terms data warehousing and OLAP are sometimes used interchangeably. However, it is important to
understand their differences because each represents a unique set of technologies, administrative issues, and
user implications.

SQL Server Tools for Data Warehousing and OLAP

Microsoft SQL Server provides several tools for building data warehouses and data marts, and OLAP systems.
Using DTS Designer, you can define the steps, workflow, and transformations necessary to build a data
warehouse from a variety of data sources. After the data warehouse is built, you can use Microsoft SQL Server
OLAP Services, which provides a robust OLAP server that can be used to analyze data stored in many different
formats, including SQL Server and Oracle databases.

See Also

In This Volume

Data Transformation Services Data Warehousing Support

Installing OLAP Services

Server Improvements and New Features

OLAP Services Data Warehousing Support

In Other Volumes

"Overview of Data Transformation Services" in Microsoft SQL Server Distributed Data Operations and
Replication

Top Of Page

Data Warehouse Characteristics

A data warehouse can assist decision support and online analytical processing (OLAP) applications because it
provides data that is:

Consolidated and consistent.

Subject-oriented.

Historical.

Consolidated and Consistent Data

A data warehouse consolidates operational data from a variety of sources with consistent naming conventions,
measurements, physical attributes, and semantics.

For example, in many organizations, applications can often use similar data in different formats: dates can be
stored in Julian or Gregorian format; true/false data can be represented as one/zero, on/off, true/false, or
positive/negative. Different applications can also use different terms to describe the same type of data. One
application can use the term "balance" instead of "total amount" to represent the amount of money in a bank

…microsoft.com/…/cc917548(printer).aspx 2/20
3/6/2011 Chapter 12 - Data Warehousing and Onl…
account.

Data should be stored in the data warehouse in a single, acceptable format agreed to by business analysts,
despite variations in the external operational sources. This allows data from across the organization, such as
legacy data on mainframes, data in spreadsheets, or even data from the Internet, to be consolidated in the
data warehouse, and effectively cross-referenced, giving the analysts a better understanding of the business.

Subject-oriented Data

Operational data sources across an organization tend to hold a large amount of data about a variety of
business-related functions, such as customer records, product information, and so on. However, most of this
information is also interspersed with data that has no relevance to business or executive reporting, and is
organized in a way that makes querying the data awkward. The data warehouse organizes only the key
business information from operational sources so that it is available for business analysis.

Historical Data

Data in OLTP systems correctly represents the current value at any moment in time. For example, an order-
entry application always shows the current value of stock inventory; it does not show the inventory at some
time in the past. Querying the stock inventory a moment later may return a different response. However, data
stored in a data warehouse is accurate as of some past point in time because the data stored represents
historical information.

The data stored in a data warehouse typically represents data over a long period of time; perhaps up to ten
years or more. OLTP systems often contain only current data, because maintaining large volumes of data used
to represent ten years of information in an OLTP system can affect performance. In effect, the data
warehouse stores snapshots of a business's operational data generated over a long period of time. It is
accurate for a specific moment in time and cannot change. This contrasts with an OLTP system where data is
always accurate and can be updated when necessary.

Read-only Data

After data has been moved to the data warehouse successfully, it typically does not change unless the data
was incorrect in the first place. Because the data stored in a data warehouse represents a point in time, it
must never be updated. Deletes, inserts, and updates (other than those involved in the data loading process)
are not applicable in a data warehouse. The only operations that occur in a data warehouse, when it has been
set up, are loading and querying data.

Top Of Page

Data Granularity

A significant difference between an OLTP or operational system and a data warehouse is the granularity of the
data stored. An operational system typically stores data at the lowest level of granularity: the maximum level
of detail. However, because the data warehouse contains data representing a long period in time, simply
storing all detail data from an operational system can result in an overworked system that takes too long to
query.

A data warehouse typically stores data in different levels of granularity or summarization, depending on the
…microsoft.com/…/cc917548(printer).aspx 3/20
3/6/2011 Chapter 12 - Data Warehousing and Onl…
data requirements of the business. If an enterprise needs data to assist strategic planning, then only highly
summarized data is required. The lower the level of granularity of data required by the enterprise, the higher
the number of resources (specifically data storage) required to build the data warehouse. The different levels
of summarization in order of increasing granularity are:

Current operational data

Historical operational data

Aggregated data

Metadata

Current and historical operational data are taken, unmodified, directly from operational systems. Historical data
is operational level data no longer queried on a regular basis, and is often archived onto secondary storage.

Aggregated, or summary, data is a filtered version of the current operational data. The design of the data
warehouse affects how the current data is aggregated. Considerations for generating summary data include the
period of time used to aggregate the data (for example, weekly, monthly, and so on), and the parts of the
operational data to be summarized. For example, an organization can choose to aggregate at the part level the
quantity of parts sold per sales representative per week.

There may be several levels of summary data. It may be necessary to create summary level data based on an
aggregated version of existing summary data. This can give an organization an even higher level view of the
business. For example, an organization can choose to aggregate summary level data further by generating the
quantity of parts sold per month.

Metadata does not contain any operational data, but is used to document the way the data warehouse is
constructed. Metadata can describe the structure of the data warehouse, source of the data, rules used to
summarize the data at each level, and any transformations of the data from the operational systems.

Top Of Page

Data Marts

A data mart is typically defined as a subset of the contents of a data warehouse, stored within its own
database. A data mart tends to contain data focused at the department level, or on a specific business area.
The data can exist at both the detail and summary levels. The data mart can be populated with data taken
directly from operational sources, similar to a data warehouse, or data taken from the data warehouse itself.
Because the volume of data in a data mart is less than that in a data warehouse, query processing is often
faster.

Characteristics of a data mart include:

Quicker and simpler implementation.

Lower implementation cost.

Needs of a specific business unit or function met.

Protection of sensitive information stored elsewhere in the data warehouse.

Faster response times due to lower volumes of data.

Distribution of data marts to user organizations.

Built from the bottom upward.

Departmental or regional divisions often determine whether data marts or data warehouses are used. For
example, if managers in different sales regions require data from only their region, then it can be beneficial to
build data marts containing specific regional data. If regional managers require access to all the organization's
data, then a larger data warehouse is usually necessary.

Although data marts are often designed to contain data relating to a specific business function, there can be
times when users need a broader level of business data. However, because this broader-level data is often only
…microsoft.com/…/cc917548(printer).aspx 4/20
3/6/2011 Chapter 12 - Data Warehousing and Onl…
needed in summarized form, it is acceptable to store it within each data mart rather than implementing a full
data warehouse.

Building a Data Warehouse from Data Marts

Data warehouses can be built using a top-down or bottom-up approach. Top-down describes the process of
building a data warehouse for the entire organization, containing data from multiple, heterogeneous, operational
sources. The bottom-up approach describes the process of building data marts for departments, or specific
business areas, and then joining them to provide the data for the entire organization. Building a data
warehouse from the bottom-up, by implementing data marts, is often simpler because it is less ambitious.

A common approach to using data marts and data warehouses involves storing all detail data within the data
warehouse, and summarized versions within data marts. Each data mart contains summarized data per
functional split within the business, such as sales region or product group, further reducing the data volume per
data mart.

Data Mart Considerations

Data marts can be useful additions or alternatives to the data warehouse, but issues to consider before
implementation include:

Additional hardware and software.

Time required to populate each data mart regularly.

Consistency with other data marts and the data warehouse.

Network access (if each data mart is located in a different geographical region).

Top Of Page

Designing and Building a Data Warehouse and OLAP System

The steps required to build a data warehouse include:

Determining business, user, and technical requirements.

…microsoft.com/…/cc917548(printer).aspx 5/20
3/6/2011 Chapter 12 - Data Warehousing and Onl…
Designing and building the database.

Extracting and loading data into the data warehouse.

Designing and processing aggregations using OLAP tools.

Querying and maintaining the data warehouse and OLAP databases.

Determining Business, User, and Technical Requirements

Before a data warehouse can be built, a detailed project and implementation plan should be written. The
project and implementation plan includes:

Building a business case.

Gathering user requirements.

Determining the technical requirements.

Defining standard reports required by users.

Analyzing client application tools being used.

Building the business case is common at the beginning of any project. It involves determining the business
needs solved by the project, the costs of the project, and the return on the investment.

Gathering user requirements largely involves interviewing the intended users of the data warehouse. The user
requirements determine:

Data requirements (level of granularity).

Operational systems within the enterprise containing the data.

Business rules followed by the data.

Queries required to provide the users with data.

The technical requirements may involve determining:

Hardware architecture and infrastructure (for example, links to remote geographical regions where data
marts might be located).

Backup and recovery mechanisms.

Security guidelines.

Methods of loading and transforming data from operational systems to the data warehouse.

Standard reports required by users should be analyzed to determine the tables, columns, and selection criteria
necessary to create the reports, and the frequency in which they are generated. Provisions should also be
made for expanding or modifying the scope of reports as required.

Client application tools should be analyzed to determine if they can provide enhanced processing capabilities
that help in processing data, performing queries, or generating reports.

Designing and Building the Database

Designing and building the database is a critical part of building a successful data warehouse. This step is often
performed by experienced database designers because it can involve taking data from multiple (sometimes
heterogeneous) sources and combining it into a single, logical model.

Unlike OLTP systems that store data in a highly normalized fashion, the data in the data warehouse is stored in
a very denormalized manner to improve query performance. Data warehouses often use star and snowflake
schemas to provide the fastest possible response times to complex queries, and the basis for aggregations
managed by OLAP tools.
…microsoft.com/…/cc917548(printer).aspx 6/20
3/6/2011 Chapter 12 - Data Warehousing and Onl…
The components of schema design are dimensions, keys, and fact and dimension tables.

Fact tables

Contain data that describes a specific event within a business, such as a bank transaction or product sale.
Alternatively, fact tables can contain data aggregations, such as sales per month per region. Except in cases
such as product or territory realignments, existing data within a fact table is not updated; new data is simply
added.

Because fact tables contain the vast majority of the data stored in a data warehouse, it is important that the
table structure be correct before data is loaded. Expensive table restructuring can be necessary if data
required by decision support queries is missing or incorrect.

The characteristics of fact tables are:

Many rows; possibly billions.

Primarily numeric data; rarely character data.

Multiple foreign keys (into dimension tables).

Static data.

Dimension tables

Contain data used to reference the data stored in the fact table, such as product descriptions, customer
names and addresses, and suppliers. Separating this verbose (typically character) information from specific
events, such as the value of a sale at one point in time, makes it possible to optimize queries against the
database by reducing the amount of data to be scanned in the fact table.

Dimension tables do not contain as many rows as fact tables, and dimensional data is subject to change, as
when a customer's address or telephone number changes. Dimension tables are structured to permit change.

The characteristics of dimension tables are:

Fewer rows than fact tables; possibly hundreds to thousands.

Primarily character data.

Multiple columns that are used to manage dimension hierarchies.

One primary key (dimensional key).

Updatable data.

Dimensions

Are categories of information that organize the warehouse data, such as time, geography, organization, and so
on. Dimensions are usually hierarchical in that one member may be a child of another member. For example, a
geography dimension may contain data by country/region, state, and city. A city member is a child to a state
member, which is in itself a child to a country member. Thus, the dimension is comprised of three hierarchical
levels: all countries, all states, and all cities in the dimension table. To support this, the dimension table should
include the relationship of each member to the higher levels in the hierarchy.

Dimensional keys

Are unique identifiers used to query data stored in the central fact table. The dimensional key, like a primary
key, links a row in the fact table with one dimension table. This structure makes it easy to construct complex
queries and support drill-down analysis in decision support applications. An optimal data warehouse database
contains long, narrow fact tables and small, wide dimension tables.

Star Schema

The most popular design technique used to implement a data warehouse is the star schema. The star schema
structure takes advantage of typical decision support queries by using one central fact table for the subject

…microsoft.com/…/cc917548(printer).aspx 7/20
3/6/2011 Chapter 12 - Data Warehousing and Onl…
area, and many dimension tables containing denormalized descriptions of the facts. After the fact table is
created, OLAP tools can be used to preaggregate commonly accessed information.

The star schema design helps to increase query performance by reducing the volume of data that is read from
disk. Queries analyze data in the smaller dimension tables to obtain the dimension keys that index into the
central fact table, reducing the number of rows to be scanned.

Snowflake Schema

The snowflake schema is a variation of the star schema where dimension tables are stored in a more normalized
form. This can help improve performance for queries by reducing the number of disk reads.

Creating a Database Schema

The database schema should support the business requirements rather than the typical query-driven
requirements of an OLTP database design. For example, given the following database schema from an order
entry system:

…microsoft.com/…/cc917548(printer).aspx 8/20
3/6/2011 Chapter 12 - Data Warehousing and Onl…

The steps involved in converting this OLTP schema into a star schema include:

Determining the fact and dimension tables.

Designing the fact tables.

Designing the dimension tables.

Determining Fact and Dimension Tables

It is important to determine correctly what existing tables and data in the operational systems should comprise
the fact and dimension tables. If these are not correctly identified, then the data warehouse can suffer from
poor performance, or may have to be redesigned at a later stage. Redesigning an operational data warehouse,
possibly containing large amounts of data, can be a prohibitively expensive task.

Although the process of determining the structure and composition of fact and dimension tables can be
difficult, especially when there are multiple (heterogeneous) operational systems to consider, the two most
important steps to follow are identifying the:

Fundamental business transactions on which the data warehouse will focus (fact tables).

Data associated with the business transactions that determine how business data will be analyzed
(dimension tables and hierarchies).

Identify Fundamental Business Transactions

The first step involves identifying the transactions that describe the basic operations of the business which the
data warehouse will be used to analyze. For example, using the sample order entry system described earlier,
the underlying business transaction is a sale of a product. Identifying the fundamental business transactions
yields the information that will be represented in the fact tables. The information needed to describe the sale

…microsoft.com/…/cc917548(printer).aspx 9/20
3/6/2011 Chapter 12 - Data Warehousing and Onl…
of a product is largely found in the Order_Details table.

When analyzing existing operational systems for potential fact tables, always look for the underlying business
processes involved. Many operational systems are designed based on necessity rather than an accurate
business model. For example, a school database may record only the grade per student for all subjects for a
year because it does not have enough disk space to store the data at a lower level of detail.

In this instance, a data warehouse used to store student data from all schools in a region should be designed
to capture this summarized data as well as a lower level of detail when the schools are able to provide the
information in the future. For example, the fact table might store details regarding the grades for each subject,
per student, per school, per region, per date period.

Identify the Dimension Tables

The next step involves identifying the entities that describe how the fact data will be analyzed. For example,
given that the order entry system fundamental transaction is the sale of a product, dimension data from the
operational schema could include payment method, product name, date of sale, or shipping method. However,
the dimension data chosen should represent the focus of the business analysis. As an example, the business
analysis performed on the order entry data warehouse will include variations of:

Sales of a specific product per region.

Sales of a specific product per time period (for example, a quarter).

All sales per region.

All sales per time period.

Therefore, the dimension tables will include product data, region data, and time period data. In this example,
payment or shipping methods were not required because the business will not use the data warehouse to
analyze that data.

From the original order entry OLTP schema, all the fact and dimension data for the data warehouse can be
found in the Customers, Orders, Products, and Order_Details tables.

…microsoft.com/…/cc917548(printer).aspx 10/20
3/6/2011 Chapter 12 - Data Warehousing and Onl…

Designing Fact Tables

The primary goal when designing fact tables is to minimize the size without compromising the data
requirements. Fact tables are the largest tables in the database because they contain detail level data
representing the underlying business transactions. However, the costs of storing, and maintaining these large
tables should be considered. For example, larger tables require more online, and potentially offline, storage;
take longer to backup and restore in the event of a system failure; and take longer to query when building
OLAP aggregations.

The easiest ways to reduce the size of fact tables include:

Reducing the number of columns.

Reducing the size of each column where possible.

Archiving historical data into separate fact tables.

Reducing the Number of Columns

Remove any columns that are not required to analyze the operations of the business. For example, if the data
does not represent a business transaction, or if the data can be derived using aggregates, remove the data
from the fact table. Although aggregated columns often improve query performance, the size of a typical fact
table can prohibit using them. For example, if the Order_Details fact table contains one billion rows, and a
column, Total_Price, is added representing Quantity multiplied by UnitPrice, one billion new values now exist
permanently in the table.

Important If a column is moved from a fact table to another table, and is referenced frequently in queries
involving data from the fact table, large join operations may be required. These joins can affect query
performance. Therefore, the trade-off between reducing storage costs and affecting query performance should
be determined.

…microsoft.com/…/cc917548(printer).aspx 11/20
3/6/2011 Chapter 12 - Data Warehousing and Onl…
Although Order_Details forms the basis of the fact table, the OrderID column is not required in the final fact
table because OrderDetailID is the unique identifier for the business transaction: a sale or a product. In fact,
OrderID does not represent a single business transaction; it represents the sale of one or many products to a
single customer, and so cannot be used.

Reducing the Size of Each Column

Because fact tables tend to have a large number of rows, even one redundant byte per row can add up to a
large amount of wasted database space. For example, a fact table containing one billion rows, with one unused
byte in one of the columns, represents almost 1 GB of unused database. To reduce column widths:

Ensure that all character and binary data is variable length.

Use data types that require fewer bytes of storage where possible. For example, if a column contains
integer values only in the range from 1 through 100, use tinyint rather than int, saving 3 bytes per row
for that column.

Archiving Historical Data

If data within fact tables is rarely used, such as sales data from several years ago, it may be useful to archive
the data. This approach reduces the volume of data in the fact table, hence increasing the performance of
queries. Exceptional queries, on older data, can be run against multiple fact tables without affecting the
majority of users querying the fact tables containing recent data. When Microsoft SQL Server OLAP Services is
used in conjunction with multiple fact tables, the OLAP Services engine manages queries against multiple back
end fact tables. This simplifies the management and use of multiple fact tables containing historical data.

Designing Dimension Tables

The primary goal in designing dimension tables is to denormalize the data that references the fact tables into
single tables. The most commonly used dimension data should reference the fact tables directly, rather than
indirectly through other tables. This approach minimizes the number of table joins, and speeds up performance.
For example, the order entry star schema should support the business queries:

Sales of a specific product per region

All sales per region

Currently, the dimension data describing a region (City, StateOrProvince, and Country) is part of the
Customers table. However, Customers references Order_Details (fact data) using Orders.

…microsoft.com/…/cc917548(printer).aspx 12/20
3/6/2011 Chapter 12 - Data Warehousing and Onl…
To better support the business analysis required, the region data should be placed into a new table, Region,
directly referencing Order_Details. To implement this, a foreign key from the Region dimension table is added
to Order_Details (now renamed to Sales). Any queries involving sales per region now require only a two-table
join between the Region dimension table, and the Sales fact table.

Note The existing relationship between the Sales fact table and the Products dimension data is unchanged.

Date and Time Information

Date information is a common requirement in a data warehouse. To minimize the fact table width, a foreign key
is often created in the fact table referencing a dimension table containing a representation of the date and/or
time. The representation of the date depends on business analysis requirements.

For example, the business analysis to be performed on the order entry system requires product sales
summarized by month, quarter, and year. The date information should be stored in a form that represents these
increments. This is achieved by creating a foreign key in the fact table referencing a date dimension table
(named Period) containing the date of the sale in a month, quarter, year format. To increase the flexibility of
this data, additional dimension tables are created, referenced by the Period dimension table, that contain
months, quarters, and years in more detail. When designing dimension tables for use with Microsoft SQL Server
OLAP Services, only a date is needed. The OLAP Services Time Wizard enables dates to be summarized into any
combination of weeks, months, quarters, and years.

Implementing the Database Design

…microsoft.com/…/cc917548(printer).aspx 13/20
3/6/2011 Chapter 12 - Data Warehousing and Onl…
After the fact and dimension tables have been designed, the final step is to physically implement the database
in Microsoft SQL Server.

Creating the Database

When creating the database, consider the partitioning strategy, if any, that may be used. SQL Server offers
filegroups that can be used to stripe data, in addition to the disk striping available with Microsoft Windows NT,
and hardware-based implementations.

Creating the Tables

When creating the tables used to store the fact and dimension data, consider creating the tables across the
partitions available to the database, based on usage. For example, create separate fact tables containing data
segmented by year or division on separate partitions (such as a SQL Server filegroup to improve read
performance).

Creating Any User-defined Views

Create user-defined views if necessary. SQL Server views can be used to merge horizontally partitioned tables
together logically, as interfaces to predefined queries or as a security mechanism.

Creating Indexes

Indexes should be created to maximize performance. Consider creating indexes on:

Key columns.

Columns involved in joins.

Multiple columns, to take advantage of index coverage.

All dimension table keys used by the fact table.

See Also

In Other Volumes

"CREATE VIEW" in Microsoft SQL Server Transact-SQL and Utilities Reference

"Overview of Creating and Maintaining Databases" in Microsoft SQL Server Database Developer's Companion

"Indexes" in Microsoft SQL Server Database Developer's Companion

"Physical Database Design" in Microsoft SQL Server Diagnostics

Extracting and Loading Data

Extracting and loading data from operational systems to a data warehouse varies in complexity. The process
can be simple if there is a direct correlation between the source data and the data that should appear in the
data warehouse: for example, if all the source data from a single operational system is in the correct format,
and does not have to be modified in any way. The process can also be complex: for example, if source data
resides in multiple, heterogeneous operational systems, and requires significant formatting and modification
before loading.

The extraction and load process involves:

Validating data in the operational systems.

Migrating data from the operational systems.

Scrubbing data.

Transforming data to the data warehouse.

Validating Data

…microsoft.com/…/cc917548(printer).aspx 14/20
3/6/2011 Chapter 12 - Data Warehousing and Onl…
Before data is extracted from the operational systems, it may be necessary to ensure that the data is
completely valid. If the data is not valid, the integrity of the business analysis relying on the data may be
compromised. For example, a value representing a monetary transfer between banks in different countries must
be in the correct currency.

Data should be validated at the source by business analysts who understand what the data represents. Any
changes should be made in the operational systems, rather than the data warehouse, because the source data
is incorrect regardless of where it is located.

Validating data can be a time-consuming process. The validation process can be automated by writing stored
procedures that check the data for domain integrity. However, it may be necessary to validate data manually.
If any invalid data is discovered, determine where the fault originated and correct any processes contributing
to the error.

For example, the data in the order entry system should be validated, to ensure that:

Region information (City, State, Country) represents a valid city, state, country/region combination.

Product information (ProductID, ProductName, UnitPrice) represents valid products.

This information can be validated using the Data Transformation Services import and export wizards. A
Microsoft ActiveX script, executed by the DTS Import and DTS Export wizards when copying data from the
source to the destination, can determine if the region and product information is valid. Any invalid data can be
saved to the exception log for later examination by business analysts, to determine why it is incorrect.

See Also

In Other Volumes

"Data Transformation Services Import and Export Wizards" in Microsoft SQL Server Distributed Data Operations
and Replication

"Column Mappings" in Microsoft SQL Server Distributed Data Operations and Replication

Migrating Data

Migrating data from operational systems typically involves copying the data to an intermediate database before
it is finally copied to the data warehouse. Copying data to an intermediate database is necessary if data should
be scrubbed.

Copying data should occur ideally during a period of low activity on the operational system. Otherwise, system
performance may degrade, affecting any users. Additionally, if the data warehouse is composed of data from
multiple interrelated operational systems, then it is important to ensure that data migration occurs when the
systems are synchronized. If the operational systems are not synchronized, the data in the warehouse can
produce unexpected results when queried.

The DTS Import and DTS Export wizards can be used to create a series of tasks that copy data from many
heterogeneous operational systems to an intermediate database running on Microsoft SQL Server. Alternatively,
you can use a Microsoft ActiveX script with DTS to scrub the data, and then copy it directly to the SQL Server
data warehouse, avoiding the need for an intermediate database.

See Also

In Other Volumes

"Data Transformation Services Import and Export Wizards" in Microsoft SQL Server Distributed Data Operations
and Replication

Scrubbing Data

Scrubbing data involves making the data consistent. It is possible that the same data is represented in
different ways in multiple operational systems. For example, a product name might be abbreviated in one
operational system, but not in another. If the two values were not made consistent, any queries using the data
likely would evaluate the values as different products. If the detail data in the data warehouse is to produce
…microsoft.com/…/cc917548(printer).aspx 15/20
3/6/2011 Chapter 12 - Data Warehousing and Onl…
consistent information, the product name must be made consistent for all values.

Data scrubbing can be achieved:

Using the DTS Import and DTS Export wizards to modify data as it is copied from the operational systems
to the intermediate database, or directly to the data warehouse.

By writing a Microsoft ActiveX script, executed by a program using the DTS API, to connect to the data
source and scrub the data. Any data manipulation that can be achieved using ActiveX scripting, or a
programming language such as Microsoft Visual C++, can be performed on the data.

Using a DTS Lookup, which provides the ability to perform queries using one or more named,
parameterized query strings that allow a custom transformation to retrieve data from locations other than
the immediate source or destination row being transformed.

For example, the data in the order entry system should be scrubbed, such as the values for:

State (must always be a two-character value, such as WA).

ProductName (must always be the full product name, with no abbreviations).

Using the DTS Import and DTS Export wizards, an ActiveX script can be executed during the copy process,
that checks the State value, and changes it to a known two-character value. Alternatively, the
ProductName value could be scrubbed by writing a Visual C++ program that calls the DTS API to execute
Microsoft JScript scripts , and other executable modules.

See Also

In Other Volumes

"Column Mappings" in Microsoft SQL Server Distributed Data Operations and Replication

"Data Transformation Services Import and Export Wizards" in Microsoft SQL Server Distributed Data Operations
and Replication

"DTS Lookup" in Microsoft SQL Server Distributed Data Operations and Replication

"Programming DTS Applications" in Microsoft SQL Server Building Applications

Transforming Data

During the data migration step, it is often necessary to transform operational data into a separate format
appropriate to the data warehouse design. Transformation examples include:

Changing all alphabetic characters to uppercase.

Calculating new values based on existing data, including data aggregation and summarization.

Breaking up a single data value into multiple values, such as a product code in nnnn-description format
into separate code and description values, or a date value in MMDDYY format into separate month, day,
and year values.

Merging separate data values into a single value, such as concatenating a first name value with a
surname value.

Mapping data from one representation to another, such as converting data values (1, 2, 3, 4) to (I, II,
III, IV).

Data transformation also involves formatting and modifying extracted data from operational systems into
merged or derived values that are more useful in the data warehouse. For example, copying the OrderDate
value from the order entry system to the data warehouse star schema, involves splitting the date into Month,
Quarter, and Year components. These date components are required for the type of business analysis
performed on the data warehouse.

The transformation process usually takes place during the migration process: when data is copied either

…microsoft.com/…/cc917548(printer).aspx 16/20
3/6/2011 Chapter 12 - Data Warehousing and Onl…
directly from the operational sources or from an intermediate database, because the data has been scrubbed.
For complex data migrations, DTS provides skip return values to assist in splitting data into multiple tables.

Data transformation and migration can be completed in a single step using the DTS Import and DTS Export
wizards. Transforming and migrating data from the order entry OLTP operational system schema to the data
warehouse star schema involves using the DTS Import and DTS Export wizards to:

Create a query to extract all the required detail level (fact) data.

Split OrderDate in the Orders table into Month, Quarter, and Year components and add to Period
using a Microsoft ActiveX script.

Extract the City, StateOrProvince, and Country data relating to the detail data and add to Region
using an ActiveX script.

Perform a simple table copy of Products.

Create a query to generate the data for Summary.

Each step, for example, can be built as a separate package, which is stored in the Microsoft SQL Server msdb
database, and scheduled to be executed every Friday night at midnight.

In addition to performing insert-based transformations of data, DTS provides data-driven-queries, in which data
is read from the source and transformed, and a parameterized query is executed at the destination, using the
transformed values in the destination row.

Note When using DTS to create fact tables for use with Microsoft SQL Server OLAP Services, do not create
any aggregations while migrating the data. OLAP Services is specifically designed to create the optimal
aggregations after the data warehouse has been populated with DTS. It is also unnecessary to segment a date
into week, month, quarter, or year columns in the Time dimension table. The OLAP Services Time Wizard
provides an automated facility for this type of time transformations.

See Also

In Other Volumes

"Column Mappings" in Microsoft SQL Server Distributed Data Operations and Replication

"Data-Driven Queries" in Microsoft SQL Server Distributed Data Operations and Replication

"Data Transformation Services Import and Export Wizards" in Microsoft SQL Server Distributed Data Operations
and Replication

"Understanding Data Transformation Services" in Microsoft SQL Server Building Applications

Designing and Processing Aggregations

OLAP tools are typically used to create and manage summary data. Microsoft SQL Server OLAP Services allows
aggregations to be stored in a variety of formats and locations, with dynamic connections to underlying details
in the data warehouse. Summary data is often generated to satisfy the commonly executed queries in the data
warehouse. Storing preaggregated data increases query performance, and reduces the load on the data
warehouse.

If a data warehouse is built so the data in it does not change, then preaggregating data in the fact table saves
only the disk space required by the fact table. OLAP Services uses the processing time that would have been
used to preaggregate in the fact table when it processes the fact table as it builds a cube. However,
precalculated aggregations are stored in the cube and do not need to be recalculated for each query. If a
hybrid OLAP (HOLAP) or relational OLAP (ROLAP) cube is used, the fact table is not copied into the cube as it is
in multidimensional OLAP (MOLAP) cubes, so the overhead required to retain availability of the detail data is
only the fact table size, not processing time or query response time.

Preaggregation strategy when designing a data warehouse for use by OLAP Services depends on the following
variables:

…microsoft.com/…/cc917548(printer).aspx 17/20
3/6/2011 Chapter 12 - Data Warehousing and Onl…
Stability of the data.

If the source data changes, the preaggregations have to be performed each time, whether
preaggregated in the fact table or in the OLAP cubes that have to be rebuilt from the fact table.

Query response time.

With properly designed OLAP cubes, the granularity of detail in the fact table has no effect on query
response time for queries that do not access detail facts.

Storage requirements.

A finer level of granularity in the fact table requires more storage for the fact table and for MOLAP cubes.
This is a trade-off against detail availability and choice of OLAP cube storage mode. OLAP cubes tend to
be large regardless of the storage type; therefore the storage required to retain fine granularity in the
fact table may not be particularly significant when compared to OLAP storage needs.

When designing the data warehouse for OLAP, the user's needs should drive the preaggregation strategy. The
fact table should only be preaggregated to the level of granularity below which no user would want to access
detail.

For more information, see your OLAP Services documentation

Maintaining the Data Warehouse

Maintenance of the data warehouse is an ongoing task that should be designed before the data warehouse is
made available to users. Maintenance involves:

Implementing a backup and recovery mechanism to protect the data in the event of a system failure, or
some other problem.

Archiving the database. This may be necessary to purge the database of unused historical data, and free
up space.

Running SQL Server Profiler to determine which indexes to create to enhance query performance.

See Also

In Other Volumes

"Monitoring with SQL Server Profiler" in Microsoft SQL Server Administrator's Companion

Top Of Page

Data Transformation Services Data Warehousing Support

Using Data Transformation Services (DTS), you can import and export data between multiple heterogeneous
sources using an OLE DB-based architecture, and transfer databases and database objects (for example,
indexes and stored procedures) between computers running Microsoft SQL Server version 7.0. You can also use
the data transformation capabilities of DTS to build a data warehouse from an online transaction processing
(OLTP) system. You can build data warehouses and data marts in SQL Server by importing and transferring
data from multiple heterogeneous sources interactively or automatically on a regularly scheduled basis.

DTS components include the DTS Import Wizard, DTS Export Wizard, and DTS Designer, which are available
through SQL Server Enterprise Manager. DTS also includes COM programming interfaces you can use to create
customized import, export, and transformation applications.

A transformation is the set of operations applied to source data before it is stored at the destination during the
process of building a data warehouse. For example, the DTS transformation capability allows calculating new
values from one or more source columns, or even breaking a single column into multiple values to be stored in
separate destination columns. Transformations, therefore, make it easy to implement complex data validation,
scrubbing, and enhancement during import and export.

Data Transformation Services (DTS) allows you to import, export, or transform data in a process that can be

…microsoft.com/…/cc917548(printer).aspx 18/20
3/6/2011 Chapter 12 - Data Warehousing and Onl…
saved as a package. Each package defines a workflow that includes one or more tasks executed in a
coordinated sequence as steps. Tasks can copy data from a source to a destination, transform data using a
Microsoft ActiveX script, execute an SQL statement at a server, or even run an external program. Tasks can
also transfer database objects between computers running SQL Server 7.0.

The DTS package can be created manually by using a language that supports OLE Automation, such as
Microsoft Visual Basic, or interactively by using the Data Transformation Services wizards or DTS Designer.
After a DTS package has been created and saved, it is completely self-contained and can be retrieved and run
using SQL Server Enterprise Manager or the dtsrun utility.

DTS packages can be stored in the Microsoft Repository, providing the ability to record data lineage. This
allows you to determine the source of any piece of data, and the transformations applied to that data. Data
lineage can be tracked at the package and row levels of a table and provide a complete audit trail of data
transformation and DTS package execution information in your data warehouse.

DTS Designer is a graphical design environment for creating and executing complex sets of data transformations
and workflows, in preparation for moving data to a data warehouse. Experienced users can use DTS Designer to
integrate, consolidate, and transform heterogeneous data from multiple sources. Packages created can be
stored in a SQL Server msdb database, the Repository, or a COM-structured storage file.

The visual objects used by DTS Designer are based on the DTS object model, an API that includes objects,
properties, methods, and collections designed for programs that copy and transform data from an OLE DB data
source to an OLE DB destination. This object model can be accessed through ActiveX scripts from within DTS
Designer, and through external programs written in languages such as Visual Basic and Microsoft Visual C++.
You can also access custom programs through DTS Designer, and include their tasks and icons as part of the
package. Because DTS Designer accesses an underlying programming model, it does most of the programming
work for you.

See Also

In Other Volumes

"DTS Designer" in Microsoft SQL Server Distributed Data Operations and Replication

"Overview of Data Transformation Services" in Microsoft SQL Server Distributed Data Operations and
Replication

Top Of Page

OLAP Services Data Warehousing Support

Microsoft SQL Server OLAP Services provides online analytical processing (OLAP) services to applications.

OLAP focuses on finding trends in aggregated or summarized data. The main objects used by OLAP programs
are multidimensional cubes. A multidimensional cube records a set of data derived from fact tables and
dimensions. A fact table records data about a set of transactions. Measures are numeric columns from the fact
table that are of interest to the user. For example, a car sales fact table could provide measures such as sale
price, invoice cost, tax paid, and discount. A cube represents how these measures vary over multiple
dimensions, such as by car dealer, by location, by customer, or by date.

OLAP Services provides the capability of designing, creating, and managing cubes from a data warehouse, and
then making them available to client applications written using either the OLE DB 2.0 OLAP extensions, or the
Microsoft ActiveX Data Objects 2.0 Multidimensional Objects (ADO MD).

…microsoft.com/…/cc917548(printer).aspx 19/20
3/6/2011 Chapter 12 - Data Warehousing and Onl…

The OLAP server performs multidimensional queries of data and stores the results in its multidimensional
storage. It speeds the analysis of fact tables into cubes, stores the cubes until needed, and then quickly
returns the data to clients.

The OLAP server is managed through an API called the Microsoft Decision Support Objects (DSO). OLAP
Services provides a snap-in for the Microsoft Management Console (MMC). This MMC snap-in uses DSO to
provide administrators with an easy-to-use graphical interface for defining, creating, and managing the cubes
built by the OLAP server. DSO can also be called from custom applications, which in turn can be added to the
OLAP Manager as an add-in.

OLAP Services passes its multidimensional data to a middle-tier PivotTable Service. The PivotTable Service
operates as an OLE DB for OLAP provider. It exposes the multidimensional data to applications using either the
OLE DB 2.0 OLAP extensions, or the ADO MD API that wraps the OLE DB OLAP extensions.

See Also

In This Volume

Installing OLAP Services

Server Improvements and New Features

Top Of Page

© 2011 Microsoft. All rights reserved.

…microsoft.com/…/cc917548(printer).aspx 20/20

You might also like