0% found this document useful (0 votes)
9 views

Lecture-7

Uploaded by

samkh866n
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
9 views

Lecture-7

Uploaded by

samkh866n
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 32

INFRASTRUCTURE AS THE

FOUNDATION FOR DATA


WAREHOUSING
Lecture # 07
Instructor: Mr. Sharjeel Ahmed
Slide Elements
• Infrastructure as the Foundation for Data Warehousing
• Infrastructure Supporting Architecture
• Operational Infrastructure
• Physical Infrastructure
• Hardware And Operating System
• Platform Options
• Server Hardware
• Database Software
• Parallel Processing Options
• Selection of the DBMS
• Collection of Tools
INFRASTRUCTURE SUPPORTING
ARCHITECTURE
Infrastructure Supporting Architecture
• Data warehouse infrastructure includes all the foundational elements that
enable the architecture to be implemented.

• The infrastructure includes several elements such as server hardware,


operating system, network software, database software, the LAN and WAN,
vendor tools for every architectural component, people, procedures, and
training.
Infrastructure Classification
• The elements of the data warehouse infrastructure may be classified into
two categories: operational infrastructure and physical infrastructure.
• The physical infrastructure is much wider and more fundamental.

Operational Infrastructure
• Operational infrastructure to support each architectural component consists of
• People
• Procedures
• Training
• Management software
• These are not the people and procedures needed for developing the data
warehouse.
• These are the ones needed to keep the data warehouse going. These
elements are as essential as the hardware and software that keep the data
warehouse running. They support the management of the data warehouse and
maintain its efficiency
Infrastructure Classification (Cont. )
Physical Infrastructure
• The platform consists of the basic hardware components, the operating
system with its utility software, the network, and the network software. Along
with the overall platform is the set of tools that run on the selected platform to
perform the various functions and services of individual architectural
components.
HARDWARE AND OPERATING SYSTEM
Hardware and Operating System
• Hardware and operating systems make up the computing environment for
your data warehouse.

• All the data extraction, transformation, integration, and staging jobs run on the
selected hardware under the chosen operating system.

• When you transport the consolidated and integrated data from the staging
area to your data warehouse repository, you make use of the server hardware
and the operating system software.

• When the queries are initiated from the client workstations, the server
hardware, in conjunction with the database software, executes the queries and
produces the results.
Guidelines for Hardware Selection
Here are some general guidelines for hardware selection, not entirely specific to
hardware for the data warehouse.

• Scalability: When your data warehouse grows in terms of the number of


users, the number of queries, and the complexity of the queries, ensure that
your selected hardware could be scaled up.

• Support: Vendor support is crucial for hardware maintenance. Make sure that
the support from the hardware vendor is at the highest possible level.

• Vendor Reference: It is important to check vendor references with other sites


using hardware from this vendor. You do not want to be caught with your data
warehouse being down because of hardware malfunctions when the CEO
wants some critical analysis to be completed.

• Vendor Stability: Check on the stability and staying power of the vendor.
Guidelines for OS Selection
let us quickly consider a few general criteria for the selection of the operating
system. First of all, the operating system must be compatible with the hardware.
A list of criteria follows

• Scalability: Along with the hardware and database software, the operating
system must be able to support the increase in the number of users and
applications.

• Security: The operating system must provide each client with a secure
environment.

• Reliability: The operating system must be able to protect the environment


from application malfunctions.

• Availability: The computing environment must continue to be available after


abnormal application terminations.
Guidelines for OS Selection (Cont. )
• Preemptive Multitasking: The server hardware must be able to balance the
allocation of time and resources among the multiple tasks. Also, the operating
system must be able to let a higher priority task preempt or interrupt another
task as and when needed.

• Use multithreaded approach: The operating system must be able to serve


multiple requests concurrently by distributing threads to multiple processors in
a multiprocessor hardware configuration. This feature is very important
because multiprocessor configurations are architectures of choice in a data
warehouse environment.

• Memory protection: In a data warehouse environment, large numbers of


queries are common. That means that multiple queries will be executing
concurrently. A memory protection feature in an operating system prevents
one task from violating the memory space of another
Common Options for Hardware and OS
let us go through the following list of three common options.
Mainframes:
• Leftover hardware from legacy applications
• Primarily designed for OLTP and not for decision support applications
• Not cost-effective for data warehousing
• Not easily scalable
• Rarely used for data warehousing when too much spare resources are
available for smaller data marts
Open System Servers
• UNIX servers, the choice medium for most data warehouses
• Generally robust
• Adapted for parallel processing
NT Servers
• Support medium-sized data warehouses
• Limited parallel processing capabilities
• Cost-effective for medium-sized and small data warehouses
Platform Options
Recap:
• Let us get back to quick summary recap of the functions and services of the
architectural components in the three major areas:
• Data Acquisition: data extraction, data transformation, data cleansing, data
integration, and data staging.
• Data Storage: data loading, archiving, and data management.
• Information Delivery: report generation, query processing, and complex
analysis.

Platform Options:
• We will now discuss platform options in terms of the functions in these three
areas.
Platform Options
1. Single Platform Option

• This is the most straightforward and simplest option for implementing the data
warehouse architecture.
• In this option, all functions from the backend data extraction to the front-end
query processing are performed on a single computing platform.
• This was perhaps the earliest approach, when developers were implementing
data warehouses on existing mainframes, minicomputers, or a single UNIX-
based server.
• Because all operations in the data acquisition, data storage, and information
delivery areas take place on the same platform, this option hardly ever
encounters any compatibility or interface problems.
• The data flows smoothly from beginning to end without any platform-to-
platform conversions. No middleware is needed.
• All tools work in a single computing environment.
Platform Options – Hybrid Option

• If the company falls in the category where the legacy platform will
accommodate your data warehouse, then, by all means, take the approach of
a single-platform solution. Again, the single-platform solution, if feasible, is an
easier solution.
• For the rest of us who are not that fortunate, we have to consider other
options. Let us begin with data extraction, the first major operation, and follow
the flow of data until it is consolidated into load images and waiting in the
staging area.

• We will now step through the data flow and examine the platform options.

i. Data Extraction: In any data warehouse, it is best to perform the data


extraction function from each source system on its own computing platform.
If your telephone sales data resides in a minicomputer environment, create
extract files on the mini-computer itself for telephone sales.
Platform Options – Hybrid Option (Cont. )

ii. Initial Reformatting and Merging: After creating the raw data extracts from
the various sources, the extracted files from each source are reformatted
and merged into a smaller number of extract files. Just like the extraction
step, it is best to do this step of initial merging of each set of source extracts
on the source platform itself.

iii. Preliminary Data Cleansing: In this step, you verify the extracted data from
each data source for any missing values in individual fields, supply default
values, and perform basic edits. This is another step for the computing
platform of the source system itself.

iv. Transformation and Consolidation: This step comprises all the major data
transformation and integration functions. Usually, you will use transformation
software tools for this purpose. Where is the best place to perform this step?
Obviously, not in any individual legacy platform. You perform this step on the
platform where your staging area resides.
Platform Options – Hybrid Option (Cont. )

v. Validation and Final Quality Check: This step of final validation and
quality check is a strong candidate for the staging area. You will arrange for
this step to happen on that platform.

vi. Creation of Load Images: This step creates load images for individual
database files of the data warehouse repository. This step almost always
occurs in the staging area and, therefore, on the platform where the staging
area resides.
DATABASE SOFTWARE
Database Software
• Data-warehouse related add-ons are becoming part of the database offerings.
• The database software that started out for use in operational OLTP systems is
being enhanced to cater to decision support systems.
• Some RDBMS products now include support for the data acquisition area of
the data warehouse.
• Mass loading & retrieval of data from other DB systems have become easier.
• Some vendors have paid special attention to the data transformation function.
• Replication features have been reinforced to assist in bulk refreshes and
incremental loading of the data warehouse.
• Apart from these enhancements, the more important ones relate to load
balancing and query performance. These two features are critical in a data
warehouse. Your data warehouse is query-centric.
• Everything that can be done to improve query performance is most desirable.
• The DBMS vendors are providing parallel processing features to improve
query performance. Let us briefly review the parallel processing options within
the DBMS that can take full advantage of parallel server hardware.
Parallel Processing Options
• Parallel processing options in database software are intended only for
machines with multiple processors.
• Most of the current database software can parallelize a large number of
operations. These operations include the following: mass loading of data, full
table scans, queries with exclusion conditions, queries with grouping, selection
with distinct values, aggregation, sorting, creation of tables using sub queries,
creating and rebuilding indexes, inserting rows into a table from other tables,
enabling constraints, star transformation (an optimization technique when
processing queries against a STAR schema), and so on.
• Let us now examine what happens when a user initiates a query at the
workstation. Each session accesses the database through a server process.
The query is sent to the DBMS and data retrieval takes place from the
database. Data is retrieved and the results are sent back, all under the control
of the dedicated server process. The query dispatcher software is responsible
for splitting the work, distributing the units to be performed among the pool of
available query server processes, and balancing the load. Finally, the results
of the query processes are assembled and returned as a single, consolidated
result set.
Parallel Processing Options (Cont. )
Inter-query Parallelization
• In this method, several server processes handle multiple requests
simultaneously. Multiple queries may be serviced based on your server
configuration and the number of available processors.
• However, inter-query parallelism is limited. Multiple queries are processed
concurrently, but each query is still being processed serially by a single server
process. Suppose a query consists of index read, data read, sort, and join
operations; these operations are carried out in this order. Each operation must
finish before the next one can begin. Parts of the same query do not execute
in parallel. To overcome this limitation, many DBMS vendors have come up
with versions of their products to provide intra-query parallelization.
Intra-query Parallelization
• Using the intra-query parallelization technique, the DBMS splits the query into
the lower-level operations of index read, data read, data join, and data sort.
Then each one of these basic operations is executed in parallel on a single
processor. The final result set is the consolidation of the intermediary results.
Parallel Processing Options (Cont. )
Three ways a DBMS can provide intra-query parallelization
i. Horizontal Parallelism.
• The data is partitioned across multiple disks. Parallel processing occurs within
each single task in the query.
Parallel Processing Options (Cont. )
ii. Vertical Parallelism.
• This kind of parallelism occurs among different tasks, not just a single task in a
query as in the case of horizontal parallelism. All component query operations
are executed in parallel, but in a pipelined manner. This assumes that the
RDBMS has the capability to decompose the query into subtasks; each
subtask has all the operations of index read, data read, join, and sort. Then
each subtask executes on the data in serial fashion.

iii. Hybrid Method.


• In this method, the query decomposer partitions the query both horizontally
and vertically. Naturally, this approach produces the best results. You will
realize the greatest utilization of resources, optimal performance, and high
scalability
Selection of the DBMS
• Selection of the DBMS is most crucial. Your choice of the DBMS must match
with the selected server hardware.
• Apart from the criteria that the selected DBMS must have load balancing and
parallel processing options, the other key features listed below must be
considered when selecting the DBMS for your data warehouse.
• Query governor—to anticipate and abort runaway queries
• Query optimizer—to parse and optimize user queries
• Query management—to balance the execution of different types of queries
• Load utility—for high-performance data loading, recovery, and restart
• Metadata management—with an active data catalog or dictionary
• Scalability—in terms of both number of users and data volumes
• Extensibility—having hybrid extensions to OLAP databases
• Portability—across platforms
• Query tool APIs—for tools from leading vendors
• Administration—providing support for all DBA functions
COLLECTION OF TOOLS
Architecture First, Then Tools
• The title of this subsection simply means this: ignore the tools; design the
architecture first; then, and only then, choose the tools to match the functions
and services stipulated for the architectural components.

• Do the architecture first; select the tools later.

• Why is this principle sacred? Why is it not advisable to just buy the set of tools
and then use the tools to build and to deploy your data warehouse?

• The reason for this is that The tool may not meet the requirements as would
have been reflected in the architecture.
Collection of Tools
• Software tools are available for every architectural component of the data
warehouse.
• Software tools are extremely important in a data warehouse. As you have
seen from this figure, tools cover all the major functions.
Types of Software Tools
Data Modeling
• Enable developers to create and maintain data models for the source systems
and warehouse target databases. If necessary, data models may be created
for the staging area.
• Provide forward engineering capabilities to generate the database schema.
• Provide reverse engineering capabilities to generate the data model from the
data dictionary entries of existing source databases.
• Provide dimensional modeling capabilities to data designers for creating STAR
schemas.

Data Extraction
• Two primary extraction methods are available: bulk extraction for full refreshes
and change-based replication for incremental loads.
• Tool choices depend on the following factors: source system platforms and
databases, and available built-in extraction and duplication facilities in the
source systems.
Types of Software Tools (Cont. )
Data Transformation
• Transform extracted data into appropriate formats and data structures.
• Provide default values as specified.
• Major features include field splitting, consolidation, standardization, and de-
duplication.

Data Loading
• Load transformed and consolidated data in the form of load images into the
data warehouse repository.
• Some loaders generate primary keys for the tables being loaded.
• For load images available on the same RDBMS engine as the data
warehouse, pre-coded procedures stored on the database itself may be used
for loading.
Types of Software Tools (Cont. )
Data Quality
• Assist in locating and correcting data errors.
• May be used on the data in the staging area or on the source systems directly.
• Help resolve data inconsistencies in load images.

Queries and Reports


• Allow users to produce canned, graphic-intensive, sophisticated reports.
• Help users to formulate and run queries.
• Two main classifications are report writers, report servers.
Types of Software Tools (Cont. )
Online Analytical Processing (OLAP)
• Allow users to run complex dimensional queries.
• Enable users to generate canned queries.
• Two categories of online analytical processing are multidimensional online
analytical processing (MOLAP) and relational online analytical processing
(ROLAP). MOLAP works with proprietary multidimensional databases that
receive data feeds from the main data warehouse. ROLAP provides online
analytical processing capabilities from the relational database of the data
warehouse itself.

Alert Systems
• Highlight and get user’s attention based on defined exceptions.
• Provide alerts from the data warehouse database to support strategic
decisions.
• Three basic alert types are: from individual source systems, from integrated
enterprise-wide data warehouses, and from individual data marts.
Types of Software Tools (Cont. )
Middleware and Connectivity
• Transparent access to source systems in heterogeneous environments.
• Transparent access to databases of different types on multiple platforms.
• Tools are moderately expensive but prove to be invaluable for providing
interoperability among the various data warehouse components.

Data Warehouse Management


• Assist data warehouse administrators in day-to-day management.
• Some tools focus on the load process and track load histories.
• Other tools track types and number of user queries.

You might also like