Security, Backup, Recovery, Tuning, Testing of Data Mining and Warehousing
Security, Backup, Recovery, Tuning, Testing of Data Mining and Warehousing
UNIT V
Data Warehousing and Data Mining - Data
Visualization and Overall Perspective
Security features affect the performance of the data warehouse, therefore it is important to
determine the security requirements as early as possible. It is difficult to add security features after
the data warehouse has gone live.
We need to first classify the data and then classify the users on the basis of the data they can access.
Data Classification
The following two approaches can be used to classify the data −
● Data can be classified according to its sensitivity. Highly-sensitive data is classified as highly restricted and less-
sensitive data is classified as less restrictive.
● Data can also be classified according to the job function. This restriction allows only specific users to view particular
data. Here we restrict the users to view only that part of the data in which they are interested and are responsible
for.
User classification
The following approaches can be used to classify the users −
● Users can be classified as per the hierarchy of users in an organization, i.e., users can be classified by departments,
sections, groups, and so on.
● Users can also be classified according to their role, with people grouped across departments based on their role.
Audit Requirements
Auditing is a subset of security, a costly activity. Auditing can cause heavy overheads on the system. To complete an audit
in time, we require more hardware and therefore, it is recommended that wherever possible, auditing should be switched
off. Audit requirements can be categorized as follows −
● Connections
● Disconnections
● Data access
● Data change
Network Requirements
Network security is as important as other securities. We cannot ignore the network security requirement. We need to
consider the following issues −
● Is it necessary to encrypt data before transferring it to the data warehouse?
● Are there restrictions on which network routes the data can take?
Data Movement
There exist potential security implications while moving the data. Suppose we need to transfer some restricted data as a
flat file to be loaded. When the data is loaded into the data warehouse, the following questions are raised −
● Where is the flat file stored?
● Who has access to that disk space?
If we talk about the backup of these flat files, the following questions are raised −
● Do you backup encrypted or decrypted versions?
● Do these backups need to be made to special tapes that are stored separately?
● Who has access to these tapes?
Some other forms of data movement like query result sets also need to be considered. The questions raised while creating
the temporary table are as follows −
● Where is that temporary table to be held?
● How do you make such table visible?
Documentation
The audit and security requirements need to be properly documented. This will be treated as a part of justification. This
document can contain all the information gathered from −
● Data classification
● User classification
● Network requirements
● Data movement and storage requirements
● All auditable actions
Data Warehousing - Backup
Backup Terminologies:
● Complete backup − It backs up the entire database at the same time. This backup includes all the database files,
control files, and journal files.
● Partial backup − As the name suggests, it does not create a complete backup of the database. Partial backup is very
useful in large databases because they allow a strategy whereby various parts of the database are backed up in a
round-robin fashion on a day-to-day basis, so that the whole database is backed up effectively once a week.
● Cold backup − Cold backup is taken while the database is completely shut down. In multi-instance environment, all
the instances should be shut down.
● Hot backup − Hot backup is taken when the database engine is up and running. The requirements of hot backup
varies from RDBMS to RDBMS.
● Online backup − It is quite similar to hot backup.
Hardware Backup
It is important to decide which hardware to use for the backup. The speed of processing the backup and restore depends
on the hardware being used, how the hardware is connected, bandwidth of the network, backup software, and the speed
of server's I/O system. Here we will discuss some of the hardware choices that are available and their pros and cons.
These choices are as follows −
● Tape Technology
● Disk Backups
Software Backups
There are software tools available that help in the backup process. These software tools come as a package. These tools
not only take backup, they can effectively manage and control the backup strategies. There are many software packages
available in the market.
Networker Legato
ADSM IBM
Omniback II HP
Alexandria Sequent
Data Warehousing - Tuning
A data warehouse keeps evolving and it is unpredictable what query the user is going to post in the future. Therefore it
becomes more difficult to tune a data warehouse system.
Integrity Checks
Integrity checking highly affects the performance of the load. Following are the points to remember −
● Integrity checks need to be limited because they require heavy processing power.
● Integrity checks should be applied on the source system to avoid performance degrade of data load.
Data Load Tuning
Data load is a critical part of overnight processing. Nothing else can run until data load is complete. This is the entry point
into the system.
There are various approaches of tuning data load that are discussed below −
● The very common approach is to insert data using the SQL Layer. In this approach, normal checks and constraints
need to be performed. When the data is inserted into the table, the code will run to check for enough space to insert
the data. If sufficient space is not available, then more space may have to be allocated to these tables. These
checks take time to perform and are costly to CPU.
● The second approach is to bypass all these checks and constraints and place the data directly into the preformatted
blocks. These blocks are later written to the database. It is faster than the first approach, but it can work only with
whole blocks of data. This can lead to some space wastage.
● The third approach is that while loading the data into the table that already contains the table, we can maintain
indexes.
● The fourth approach says that to load the data in tables that already contain data, drop the indexes & recreate them
when the data load is complete. The choice between the third and the fourth approach depends on how much data
is already loaded and how many indexes need to be rebuilt.
Data Warehousing - Testing
Testing is very important for data warehouse systems to make them work correctly and efficiently. There are three basic
levels of testing performed on a data warehouse −
● Unit testing
● Integration testing
● System testing
Unit Testing
● In unit testing, each component is separately tested.
● Each module, i.e., procedure, program, SQL Script, Unix shell is tested.
● This test is performed by the developer.
Integration Testing
● In integration testing, the various modules of the application are brought together and then tested against the
number of inputs.
● It is performed to test whether the various components do well after integration.
System Testing
● In system testing, the whole data warehouse application is tested together.
● The purpose of system testing is to check whether the entire system works correctly together or not.
● System testing is performed by the testing team.
● Since the size of the whole data warehouse is very large, it is usually possible to perform minimal system testing
before the test plan can be enacted.