Lecture-7
Lecture-7
Operational Infrastructure
• Operational infrastructure to support each architectural component consists of
• People
• Procedures
• Training
• Management software
• These are not the people and procedures needed for developing the data
warehouse.
• These are the ones needed to keep the data warehouse going. These
elements are as essential as the hardware and software that keep the data
warehouse running. They support the management of the data warehouse and
maintain its efficiency
Infrastructure Classification (Cont. )
Physical Infrastructure
• The platform consists of the basic hardware components, the operating
system with its utility software, the network, and the network software. Along
with the overall platform is the set of tools that run on the selected platform to
perform the various functions and services of individual architectural
components.
HARDWARE AND OPERATING SYSTEM
Hardware and Operating System
• Hardware and operating systems make up the computing environment for
your data warehouse.
• All the data extraction, transformation, integration, and staging jobs run on the
selected hardware under the chosen operating system.
• When you transport the consolidated and integrated data from the staging
area to your data warehouse repository, you make use of the server hardware
and the operating system software.
• When the queries are initiated from the client workstations, the server
hardware, in conjunction with the database software, executes the queries and
produces the results.
Guidelines for Hardware Selection
Here are some general guidelines for hardware selection, not entirely specific to
hardware for the data warehouse.
• Support: Vendor support is crucial for hardware maintenance. Make sure that
the support from the hardware vendor is at the highest possible level.
• Vendor Stability: Check on the stability and staying power of the vendor.
Guidelines for OS Selection
let us quickly consider a few general criteria for the selection of the operating
system. First of all, the operating system must be compatible with the hardware.
A list of criteria follows
• Scalability: Along with the hardware and database software, the operating
system must be able to support the increase in the number of users and
applications.
• Security: The operating system must provide each client with a secure
environment.
Platform Options:
• We will now discuss platform options in terms of the functions in these three
areas.
Platform Options
1. Single Platform Option
• This is the most straightforward and simplest option for implementing the data
warehouse architecture.
• In this option, all functions from the backend data extraction to the front-end
query processing are performed on a single computing platform.
• This was perhaps the earliest approach, when developers were implementing
data warehouses on existing mainframes, minicomputers, or a single UNIX-
based server.
• Because all operations in the data acquisition, data storage, and information
delivery areas take place on the same platform, this option hardly ever
encounters any compatibility or interface problems.
• The data flows smoothly from beginning to end without any platform-to-
platform conversions. No middleware is needed.
• All tools work in a single computing environment.
Platform Options – Hybrid Option
• If the company falls in the category where the legacy platform will
accommodate your data warehouse, then, by all means, take the approach of
a single-platform solution. Again, the single-platform solution, if feasible, is an
easier solution.
• For the rest of us who are not that fortunate, we have to consider other
options. Let us begin with data extraction, the first major operation, and follow
the flow of data until it is consolidated into load images and waiting in the
staging area.
• We will now step through the data flow and examine the platform options.
ii. Initial Reformatting and Merging: After creating the raw data extracts from
the various sources, the extracted files from each source are reformatted
and merged into a smaller number of extract files. Just like the extraction
step, it is best to do this step of initial merging of each set of source extracts
on the source platform itself.
iii. Preliminary Data Cleansing: In this step, you verify the extracted data from
each data source for any missing values in individual fields, supply default
values, and perform basic edits. This is another step for the computing
platform of the source system itself.
iv. Transformation and Consolidation: This step comprises all the major data
transformation and integration functions. Usually, you will use transformation
software tools for this purpose. Where is the best place to perform this step?
Obviously, not in any individual legacy platform. You perform this step on the
platform where your staging area resides.
Platform Options – Hybrid Option (Cont. )
v. Validation and Final Quality Check: This step of final validation and
quality check is a strong candidate for the staging area. You will arrange for
this step to happen on that platform.
vi. Creation of Load Images: This step creates load images for individual
database files of the data warehouse repository. This step almost always
occurs in the staging area and, therefore, on the platform where the staging
area resides.
DATABASE SOFTWARE
Database Software
• Data-warehouse related add-ons are becoming part of the database offerings.
• The database software that started out for use in operational OLTP systems is
being enhanced to cater to decision support systems.
• Some RDBMS products now include support for the data acquisition area of
the data warehouse.
• Mass loading & retrieval of data from other DB systems have become easier.
• Some vendors have paid special attention to the data transformation function.
• Replication features have been reinforced to assist in bulk refreshes and
incremental loading of the data warehouse.
• Apart from these enhancements, the more important ones relate to load
balancing and query performance. These two features are critical in a data
warehouse. Your data warehouse is query-centric.
• Everything that can be done to improve query performance is most desirable.
• The DBMS vendors are providing parallel processing features to improve
query performance. Let us briefly review the parallel processing options within
the DBMS that can take full advantage of parallel server hardware.
Parallel Processing Options
• Parallel processing options in database software are intended only for
machines with multiple processors.
• Most of the current database software can parallelize a large number of
operations. These operations include the following: mass loading of data, full
table scans, queries with exclusion conditions, queries with grouping, selection
with distinct values, aggregation, sorting, creation of tables using sub queries,
creating and rebuilding indexes, inserting rows into a table from other tables,
enabling constraints, star transformation (an optimization technique when
processing queries against a STAR schema), and so on.
• Let us now examine what happens when a user initiates a query at the
workstation. Each session accesses the database through a server process.
The query is sent to the DBMS and data retrieval takes place from the
database. Data is retrieved and the results are sent back, all under the control
of the dedicated server process. The query dispatcher software is responsible
for splitting the work, distributing the units to be performed among the pool of
available query server processes, and balancing the load. Finally, the results
of the query processes are assembled and returned as a single, consolidated
result set.
Parallel Processing Options (Cont. )
Inter-query Parallelization
• In this method, several server processes handle multiple requests
simultaneously. Multiple queries may be serviced based on your server
configuration and the number of available processors.
• However, inter-query parallelism is limited. Multiple queries are processed
concurrently, but each query is still being processed serially by a single server
process. Suppose a query consists of index read, data read, sort, and join
operations; these operations are carried out in this order. Each operation must
finish before the next one can begin. Parts of the same query do not execute
in parallel. To overcome this limitation, many DBMS vendors have come up
with versions of their products to provide intra-query parallelization.
Intra-query Parallelization
• Using the intra-query parallelization technique, the DBMS splits the query into
the lower-level operations of index read, data read, data join, and data sort.
Then each one of these basic operations is executed in parallel on a single
processor. The final result set is the consolidation of the intermediary results.
Parallel Processing Options (Cont. )
Three ways a DBMS can provide intra-query parallelization
i. Horizontal Parallelism.
• The data is partitioned across multiple disks. Parallel processing occurs within
each single task in the query.
Parallel Processing Options (Cont. )
ii. Vertical Parallelism.
• This kind of parallelism occurs among different tasks, not just a single task in a
query as in the case of horizontal parallelism. All component query operations
are executed in parallel, but in a pipelined manner. This assumes that the
RDBMS has the capability to decompose the query into subtasks; each
subtask has all the operations of index read, data read, join, and sort. Then
each subtask executes on the data in serial fashion.
• Why is this principle sacred? Why is it not advisable to just buy the set of tools
and then use the tools to build and to deploy your data warehouse?
• The reason for this is that The tool may not meet the requirements as would
have been reflected in the architecture.
Collection of Tools
• Software tools are available for every architectural component of the data
warehouse.
• Software tools are extremely important in a data warehouse. As you have
seen from this figure, tools cover all the major functions.
Types of Software Tools
Data Modeling
• Enable developers to create and maintain data models for the source systems
and warehouse target databases. If necessary, data models may be created
for the staging area.
• Provide forward engineering capabilities to generate the database schema.
• Provide reverse engineering capabilities to generate the data model from the
data dictionary entries of existing source databases.
• Provide dimensional modeling capabilities to data designers for creating STAR
schemas.
Data Extraction
• Two primary extraction methods are available: bulk extraction for full refreshes
and change-based replication for incremental loads.
• Tool choices depend on the following factors: source system platforms and
databases, and available built-in extraction and duplication facilities in the
source systems.
Types of Software Tools (Cont. )
Data Transformation
• Transform extracted data into appropriate formats and data structures.
• Provide default values as specified.
• Major features include field splitting, consolidation, standardization, and de-
duplication.
Data Loading
• Load transformed and consolidated data in the form of load images into the
data warehouse repository.
• Some loaders generate primary keys for the tables being loaded.
• For load images available on the same RDBMS engine as the data
warehouse, pre-coded procedures stored on the database itself may be used
for loading.
Types of Software Tools (Cont. )
Data Quality
• Assist in locating and correcting data errors.
• May be used on the data in the staging area or on the source systems directly.
• Help resolve data inconsistencies in load images.
Alert Systems
• Highlight and get user’s attention based on defined exceptions.
• Provide alerts from the data warehouse database to support strategic
decisions.
• Three basic alert types are: from individual source systems, from integrated
enterprise-wide data warehouses, and from individual data marts.
Types of Software Tools (Cont. )
Middleware and Connectivity
• Transparent access to source systems in heterogeneous environments.
• Transparent access to databases of different types on multiple platforms.
• Tools are moderately expensive but prove to be invaluable for providing
interoperability among the various data warehouse components.