0% found this document useful (0 votes)
337 views16 pages

Security, Backup, Recovery, Tuning, Testing of Data Mining and Warehousing

This document discusses various aspects of testing a data warehouse system. It describes three levels of testing: unit testing tests individual components, integration testing brings components together to test integration, and system testing tests the full system together. Key aspects covered include testing the backup recovery strategy by simulating media failure or data loss scenarios.

Uploaded by

deepika2804
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
337 views16 pages

Security, Backup, Recovery, Tuning, Testing of Data Mining and Warehousing

This document discusses various aspects of testing a data warehouse system. It describes three levels of testing: unit testing tests individual components, integration testing brings components together to test integration, and system testing tests the full system together. Key aspects covered include testing the backup recovery strategy by simulating media failure or data loss scenarios.

Uploaded by

deepika2804
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 16

School of Computing Science and Engineering

Course Code : Course Name: Data Warehousing and Data Mining

UNIT V
Data Warehousing and Data Mining - Data
Visualization and Overall Perspective

Name of the Faculty: Ms. Deepika Sherawat Program Name: B.Tech.


Data Warehousing - Security

Security features affect the performance of the data warehouse, therefore it is important to
determine the security requirements as early as possible. It is difficult to add security features after
the data warehouse has gone live.

The following activities get affected by security measures −


● User access
● Data load
● Data movement
● Query generation
User Access

We need to first classify the data and then classify the users on the basis of the data they can access.

Data Classification
The following two approaches can be used to classify the data −
● Data can be classified according to its sensitivity. Highly-sensitive data is classified as highly restricted and less-
sensitive data is classified as less restrictive.
● Data can also be classified according to the job function. This restriction allows only specific users to view particular
data. Here we restrict the users to view only that part of the data in which they are interested and are responsible
for.
User classification
The following approaches can be used to classify the users −
● Users can be classified as per the hierarchy of users in an organization, i.e., users can be classified by departments,
sections, groups, and so on.
● Users can also be classified according to their role, with people grouped across departments based on their role.
Audit Requirements
Auditing is a subset of security, a costly activity. Auditing can cause heavy overheads on the system. To complete an audit
in time, we require more hardware and therefore, it is recommended that wherever possible, auditing should be switched
off. Audit requirements can be categorized as follows −
● Connections
● Disconnections
● Data access
● Data change

Network Requirements
Network security is as important as other securities. We cannot ignore the network security requirement. We need to
consider the following issues −
● Is it necessary to encrypt data before transferring it to the data warehouse?
● Are there restrictions on which network routes the data can take?
Data Movement

There exist potential security implications while moving the data. Suppose we need to transfer some restricted data as a
flat file to be loaded. When the data is loaded into the data warehouse, the following questions are raised −
● Where is the flat file stored?
● Who has access to that disk space?
If we talk about the backup of these flat files, the following questions are raised −
● Do you backup encrypted or decrypted versions?
● Do these backups need to be made to special tapes that are stored separately?
● Who has access to these tapes?
Some other forms of data movement like query result sets also need to be considered. The questions raised while creating
the temporary table are as follows −
● Where is that temporary table to be held?
● How do you make such table visible?
Documentation

The audit and security requirements need to be properly documented. This will be treated as a part of justification. This
document can contain all the information gathered from −
● Data classification
● User classification
● Network requirements
● Data movement and storage requirements
● All auditable actions
Data Warehousing - Backup

Backup Terminologies:
● Complete backup − It backs up the entire database at the same time. This backup includes all the database files,
control files, and journal files.
● Partial backup − As the name suggests, it does not create a complete backup of the database. Partial backup is very
useful in large databases because they allow a strategy whereby various parts of the database are backed up in a
round-robin fashion on a day-to-day basis, so that the whole database is backed up effectively once a week.
● Cold backup − Cold backup is taken while the database is completely shut down. In multi-instance environment, all
the instances should be shut down.
● Hot backup − Hot backup is taken when the database engine is up and running. The requirements of hot backup
varies from RDBMS to RDBMS.
● Online backup − It is quite similar to hot backup.
Hardware Backup
It is important to decide which hardware to use for the backup. The speed of processing the backup and restore depends
on the hardware being used, how the hardware is connected, bandwidth of the network, backup software, and the speed
of server's I/O system. Here we will discuss some of the hardware choices that are available and their pros and cons.
These choices are as follows −
● Tape Technology
● Disk Backups
Software Backups
There are software tools available that help in the backup process. These software tools come as a package. These tools
not only take backup, they can effectively manage and control the backup strategies. There are many software packages
available in the market.

Package Name Vendor

Networker Legato

ADSM IBM

Epoch Epoch Systems

Omniback II HP

Alexandria Sequent
Data Warehousing - Tuning

A data warehouse keeps evolving and it is unpredictable what query the user is going to post in the future. Therefore it
becomes more difficult to tune a data warehouse system.

Difficulties in Data Warehouse Tuning


Tuning a data warehouse is a difficult procedure due to following reasons −
● Data warehouse is dynamic; it never remains constant.
● It is very difficult to predict what query the user is going to post in the future.
● Business requirements change with time.
● Users and their profiles keep changing.
● The user can switch from one group to another.
● The data load on the warehouse also changes with time.
Note − It is very important to have a complete knowledge of data warehouse.
Performance Assessment
Here is a list of objective measures of performance −
● Average query response time
● Scan rates
● Time used per day query
● Memory usage per process
● I/O throughput rates

Integrity Checks
Integrity checking highly affects the performance of the load. Following are the points to remember −
● Integrity checks need to be limited because they require heavy processing power.
● Integrity checks should be applied on the source system to avoid performance degrade of data load.
Data Load Tuning
Data load is a critical part of overnight processing. Nothing else can run until data load is complete. This is the entry point
into the system.
There are various approaches of tuning data load that are discussed below −
● The very common approach is to insert data using the SQL Layer. In this approach, normal checks and constraints
need to be performed. When the data is inserted into the table, the code will run to check for enough space to insert
the data. If sufficient space is not available, then more space may have to be allocated to these tables. These
checks take time to perform and are costly to CPU.
● The second approach is to bypass all these checks and constraints and place the data directly into the preformatted
blocks. These blocks are later written to the database. It is faster than the first approach, but it can work only with
whole blocks of data. This can lead to some space wastage.
● The third approach is that while loading the data into the table that already contains the table, we can maintain
indexes.
● The fourth approach says that to load the data in tables that already contain data, drop the indexes & recreate them
when the data load is complete. The choice between the third and the fourth approach depends on how much data
is already loaded and how many indexes need to be rebuilt.
Data Warehousing - Testing

Testing is very important for data warehouse systems to make them work correctly and efficiently. There are three basic
levels of testing performed on a data warehouse −
● Unit testing
● Integration testing
● System testing

Unit Testing
● In unit testing, each component is separately tested.
● Each module, i.e., procedure, program, SQL Script, Unix shell is tested.
● This test is performed by the developer.

Integration Testing
● In integration testing, the various modules of the application are brought together and then tested against the
number of inputs.
● It is performed to test whether the various components do well after integration.
System Testing
● In system testing, the whole data warehouse application is tested together.
● The purpose of system testing is to check whether the entire system works correctly together or not.
● System testing is performed by the testing team.
● Since the size of the whole data warehouse is very large, it is usually possible to perform minimal system testing
before the test plan can be enacted.

Testing Backup Recovery


Testing the backup recovery strategy is extremely important. Here is the list of scenarios for which this testing is needed −
● Media failure
● Loss or damage of table space or data file
● Loss or damage of redo log file
● Loss or damage of control file
● Instance failure
● Loss or damage of archive file
● Loss or damage of table
● Failure during data failure
Testing Operational Environment
There are a number of aspects that need to be tested. These aspects are listed below.
● Security − A separate security document is required for security testing. This document contains a list of disallowed
operations and devising tests for each.
● Scheduler − Scheduling software is required to control the daily operations of a data warehouse. It needs to be
tested during system testing. The scheduling software requires an interface with the data warehouse, which will
need the scheduler to control overnight processing and the management of aggregations.
● Disk Configuration. − Disk configuration also needs to be tested to identify I/O bottlenecks. The test should be
performed with multiple times with different settings.
● Management Tools. − It is required to test all the management tools during system testing.

Testing the Database


The database is tested in the following three ways −
● Testing the database manager and monitoring tools − To test the database manager and the monitoring tools, they
should be used in the creation, running, and management of test database.
● Testing database features − Here is the list of features that we have to test −
○ Querying in parallel
○ Create index in parallel
○ Data load in parallel
● Testing database performance − Query execution plays a very important role in data warehouse performance
Testing the Application
● All the managers should be integrated correctly and work in order to ensure that the end-to-end load, index,
aggregate and queries work as per the expectations.
● Each function of each manager should work correctly
● It is also necessary to test the application over a period of time.
● Week end and month-end tasks should also be tested.

You might also like