0% found this document useful (0 votes)
20 views55 pages

Notes - 5 Unit

Uploaded by

arjit19dec
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
20 views55 pages

Notes - 5 Unit

Uploaded by

arjit19dec
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 55

Data Warehousing & Data Mining (Unit 5)

 Data Visualization

Data visualization is a graphical representation of quantitative information and data by using visual elements like
graphs, charts, and maps. Data visualization convert large and small data sets into visuals, which is easy to
understand and process for humans. Data visualization tools provide accessible ways to understand outliers,
patterns, and trends in the data.

In the world of Big Data, the data visualization tools and technologies are required to analyze vast amounts of
information. Data visualizations are common in your everyday life, but they always appear in the form of graphs
and charts. The combination of multiple visualizations and bits of information are still referred to as Infographics.

Data visualizations are used to discover unknown facts and trends. You can see visualizations in the form of line
charts to display change over time. Bar and column charts are useful for observing relationships and making
comparisons. A pie chart is a great way to show parts-of-a-whole. And maps are the best way to share
geographical data visually.

Today's data visualization tools go beyond the charts and graphs used in the Microsoft Excel spreadsheet, which
displays the data in more sophisticated ways such as dials and gauges, geographic maps, heat maps, pie chart, and
fever chart.

 What makes Data Visualization Effective?

Effective data visualization are created by communication, data science, and design collide. Data visualizations
did right key insights into complicated data sets into meaningful and natural.

American statistician and Yale professor Edward Tufte believe useful data visualizations consist of ?complex
ideas communicated with clarity, precision, and efficiency.
To craft an effective data visualization, you need to start with clean data that is well-sourced and complete. After
the data is ready to visualize, you need to pick the right chart. After you have decided the chart type, you need to
design and customize your visualization to your liking. Simplicity is essential - you don't want to add any
elements that distract from the data.

 History of Data Visualization

The concept of using picture was launched in the 17th century to understand the data from the maps and graphs,
and then in the early 1800s, it was reinvented to the pie chart.

Several decades later, one of the most advanced examples of statistical graphics occurred when Charles Minard
mapped Napoleon's invasion of Russia. The map represents the size of the army and the path of Napoleon's retreat
from Moscow - and that information tied to temperature and time scales for a more in-depth understanding of the
event.

Computers made it possible to process a large amount of data at lightning-fast speeds. Nowadays, data
visualization becomes a fast-evolving blend of art and science that certain to change the corporate landscape over
the next few years.

 Importance of Data Visualization

Data visualization is important because of the processing of information in human brains. Using graphs and charts
to visualize a large amount of the complex data sets is more comfortable in comparison to studying the
spreadsheet and reports.

Data visualization is an easy and quick way to convey concepts universally. You can experiment with a different
outline by making a slight adjustment.

 Data visualization have some more specialties such as:

 Data visualization can identify areas that need improvement or modifications.


 Data visualization can clarify which factor influence customer behavior.
 Data visualization helps you to understand which products to place where.
 Data visualization can predict sales volumes.
Data visualization tools have been necessary for democratizing data, analytics, and making data-driven perception
available to workers throughout an organization. They are easy to operate in comparison to earlier versions of BI
software or traditional statistical analysis software. This guide to a rise in lines of business implementing data
visualization tools on their own, without support from IT.

 Why Use Data Visualization?

1. To make easier in understand and remember.


2. To discover unknown facts, outliers, and trends.
3. To visualize relationships and patterns quickly.
4. To ask a better question and make better decisions.
5. To competitive analyse.
6. To improve insights.

 Aggregation Query Facility

Aggregation in data mining is the process of finding, collecting, and presenting the data in a summarized format to
perform statistical analysis of business schemes or analysis of human patterns. When numerous data is collected
from various datasets, it’s crucial to gather accurate data to provide significant results. Data aggregation can help in
taking prudent decisions in marketing, finance, pricing the product, etc. Aggregated data groups are replaced using
statistical summaries. Aggregated data being present in the data warehouse can help one solve rational problems
which in turn can reduce the time strain in solving queries from data sets.

 How does Data aggregation work:

Data Aggregation is a need when a dataset as a whole is useless information and cannot be used for analysis. So,
the datasets are summarized into useful aggregates to acquire desirable results and also to enhance the user
experience or the application itself. They provide aggregate measurements such as sum, count and average.
Summarized data helps in the demographic study of customers, their behavior patterns. Aggregated data help in
finding useful information about a group after they are written as reports. It also helps in data lineage to understand,
record and visualize data which in turn help in tracing the root cause of errors in data analytics. There is no specific
need for an aggregated element to be number. We can also find the count of non-numeric data. Aggregation must
be done for a group of data and not based on individual data.

 Examples of aggregate data:

 Finding the average age of customer buying a particular product which can help in finding out the targeted
age group for that particular product. Instead of dealing with an individual customer, the average age of the
customer is calculated.
 Finding the number of consumers by country. This can increase sales in the country with more buyers and
help the company to enhance its marketing in a country with low buyers. Here also, instead of an individual
buyer, a group of buyers in a country are considered.
 By collecting the data from online buyers, the company can analyse the consumer behaviour pattern, the
success of the product which helps the marketing and finance department to find new marketing strategies
and planning the budget.
 Finding the value of voter turnout in a state or country. It is done by counting the total votes of a candidate
in a particular region instead of counting the individual voter records.
 Data aggregators:

Data Aggregators are a system in data mining that collects data from numerous sources, then processes the data and
repackages them into useful data packages. They play a major role in improving the data of customer by acting as
an agent. It helps in the query and delivery process where the customer requests data instances about a certain
product. The aggregators provide the customer with matched records of the product. Thereby the customer can buy
any instances of matched records.

 Working of Data aggregators:

 Working Of Data Aggregators

The working of data aggregators takes place in three steps:

 Collection of data: Collecting data from different datasets from the enormous database. The data can be
extracted using IoT (internet of things) such as
o Communications in social media
o Speech recognition like call centres
o Headlines of a news
o Browsing history and other personal data of devices.
 Processing of data: After collecting data, the data aggregator finds the atomic data and aggregates it. In the
processing technique, aggregators use various algorithms from the field of Artificial Intelligence or Machine
learning techniques. It also incorporates statistical methods to process it, like the predictive analysis. By this,
various useful insights can be extracted from raw data.
 Presentation of data: After the processing step, the data will be in a summarized format which can provide
a desirable statistical result with detailed and accurate data.

 Choice of manual or automated data aggregators:

Data aggregation can also be done by manual method. When one starts a new company, one can opt manual
aggregator by using excel sheets and by creating charts to manage performance, budget, marketing etc.

Data aggregation in a well-established company calls the need for middleware, a third party software to implement
the data automatically using tools of marketing.
But when large datasets are encountered, a Data Aggregator system is a need to provide accurate results.

 Types of Data Aggregation:

Types of data aggregation

 Time aggregation: It provides the data point for single resources for a defined time period.
 Spatial aggregation: It provided the data point for a group of resources for a defined time period.

 Time intervals for data aggregation process:

 Reporting period: The period in which the data is collected for presentation. It can either be a data point
aggregated process or simply raw data. E.g. The data is collected and processed into a summarized format
in a period of one day from a network device. Hence the reporting period will be one day.
 Granularity: The period in which data is collected for aggregation. E.g. To find the sum of data points for
a specific resource collected over a period of 10 mins. Here the granularity would be 10 mins. The value of
granularity can vary from minute to month depending upon the reporting period.
 Polling period: The frequency in which resources are sampled for data. E.g. If the group of resources can
be polled every 7 minutes which means data points for each resource is generated every 7 minutes. Polling
period and Granularity comes under spatial aggregation.

 Applications of Data Aggregation:

 Data aggregation is used in many fields where a large number of datasets are involved. It helps in making
fruitful decisions in marketing or finance management. It helps in the planning and pricing of products.
 Efficient use of data aggregation can help in the creation of marketing schemes. E.g. If the company is
performing ad campaigns on a particular platform, they must deeply analyse the data to raise sales. The
aggregation can help in analysing the execution over a respective time period of campaigns or a particular
cohort or a particular channel/platform. This can be done in three steps namely Extraction, Transform,
Visualize.
Workflow of Data Analysis in SaaS Applications.

 Data aggregation plays a major role in retail and e-commerce industries by monitoring the competitive
price. In this field, to keeping track of its fellow company is a must. Like a company should collect details
of pricing, offers etc. of other companies to know what its competitive company is up to. This can be done
by aggregating data from a single resource like its competitor website.
 Data aggregation plays an impactful role in the travel industry. It comprises research about the competitor
and gaining intelligence in marketing to reach people, image capture from their travel websites. It also
includes customer sentiment analysis which helps to find the emotions and satisfaction based on linguistic
analyses. Failed data aggregation in this field can lead to the declined growth of the travel company.
 For the business analysis purpose, the data can be aggregated into summary formats which can help the
head of the firm to take correct decisions for satisfying the customers. It helps in inspecting groups of people.

 Data Aggregation with Web Data Integration (WDI):

Web Data Integration (WDI) is a time-consuming nature in the data mining field where the data from different
websites is aggregated into a single workflow. By using WDI, the time taken to aggregate data can be broken down
to minutes which increases accuracy and thereby prevent human-made errors. By following the use cases provided
by varied fields, the company can extract data from other sites to increase efficiency and accuracy. It can be done
whenever the company wants in the places wherever they need. The inbuilt quality control in WDI helps in
enhancing accuracy. It not only aggregates but cleans the data, also prepares it in useful forms for integration or
analysis of data. If a company wants accuracy in dealing with data, WDI is the inevitable choice.

 OLAP Servers

Online Analytical Processing (OLAP) refers to a set of software tools used for data analysis in order to make
business decisions. OLAP provides a platform for gaining insights from databases retrieved from multiple database
systems at the same time. It is based on a multidimensional data model, which enables users to extract and view
data from various perspectives. A multidimensional database is used to store OLAP data. Many Businesses
Intelligence (BI) applications rely on OLAP technology.

 Type of OLAP servers:

The three major types of OLAP servers are as follows:


 ROLAP
 MOLAP
 HOLAP

Relational OLAP (ROLAP):

Relational On-Line Analytical Processing (ROLAP) is primarily used for data stored in a relational database, where
both the base data and dimension tables are stored as relational tables. ROLAP servers are used to bridge the gap
between the relational back-end server and the client’s front-end tools. ROLAP servers store and manage warehouse
data using RDBMS, and OLAP middleware fills in the gaps.

Benefits:

 It is compatible with data warehouses and OLTP systems.


 The data size limitation of ROLAP technology is determined by the underlying RDBMS. As a result, ROLAP does
not limit the amount of data that can be stored.

Limitations:

 SQL functionality is constrained.


 It’s difficult to keep aggregate tables up to date.

 Multidimensional OLAP (MOLAP):

Through array-based multidimensional storage engines, Multidimensional On-Line Analytical Processing


(MOLAP) supports multidimensional views of data. Storage utilization in multidimensional data stores may be low
if the data set is sparse.

MOLAP stores data on discs in the form of a specialized multidimensional array structure. It is used for OLAP,
which is based on the arrays’ random-access capability. Dimension instances determine array elements, and the data
or measured value associated with each cell is typically stored in the corresponding array element. The
multidimensional array is typically stored in MOLAP in a linear allocation based on nested traversal of the axes in
some predetermined order.

However, unlike ROLAP, which stores only records with non-zero facts, all array elements are defined in MOLAP,
and as a result, the arrays tend to be sparse, with empty elements occupying a larger portion of them. MOLAP
systems typically include provisions such as advanced indexing and hashing to locate data while performing queries
for handling sparse arrays, because both storage and retrieval costs are important when evaluating online
performance. MOLAP cubes are ideal for slicing and dicing data and can perform complex calculations. When the
cube is created, all calculations are pre-generated.

Benefits:

 Suitable for slicing and dicing operations.


 Outperforms ROLAP when data is dense.
 Capable of performing complex calculations.

Limitations:

 It is difficult to change the dimensions without re-aggregating.


 Since all calculations are performed when the cube is built, a large amount of data cannot be stored in the cube itself.

 Hybrid OLAP (HOLAP):

ROLAP and MOLAP are combined in Hybrid On-Line Analytical Processing (HOLAP). HOLAP offers greater
scalability than ROLAP and faster computation than MOLAP.HOLAP is a hybrid of ROLAP and MOLAP. HOLAP
servers are capable of storing large amounts of detailed data. On the one hand, HOLAP benefits from ROLAP’s
greater scalability. HOLAP, on the other hand, makes use of cube technology for faster performance and summary-
type information. Because detailed data is stored in a relational database, cubes are smaller than MOLAP.

Benefits:

 HOLAP combines the benefits of MOLAP and ROLAP.


 Provide quick access at all aggregation levels.

Limitations

 Because it supports both MOLAP and ROLAP servers, HOLAP architecture is extremely complex.
 There is a greater likelihood of overlap, particularly in their functionalities.
 Other types of OLAP include:

 Web OLAP (WOLAP): WOLAP refers to an OLAP application that can be accessed through a web browser.
WOLAP, in contrast to traditional client/server OLAP applications, is thought to have a three-tiered architecture
consisting of three components: a client, middleware, and a database server.
 Desktop OLAP (DOLAP): DOLAP is an abbreviation for desktop analytical processing. In that case, the user can
download the data from the source and work with it on their desktop or laptop. In comparison to other OLAP
applications, functionality is limited. It is less expensive.
 Mobile OLAP (MOLAP): Wireless functionality or mobile devices are examples of MOLAP. The user is working
and accessing data via mobile devices.
 Spatial OLAP (SOLAP): SOLAP egress combines the capabilities of Geographic Information Systems (GIS) and
OLAP into a single user interface. SOLAP is created because the data can be alphanumeric, image, or vector. This
allows for the quick and easy exploration of data stored in a spatial database.

 Difference between ROLAP, MOLAP and HOLAP

1. Relational Online Analytical Processing (ROLAP) :

ROLAP servers are placed between relational backend server and client front-end tools. It uses relational or
extended DBMS to store and manage warehouse data. ROLAP has basically 3 main components: Database
Server, ROLAP server, and Front-end tool.

Advantages of ROLAP –

 ROLAP is used for handle the large amount of data.


 ROLAP tools don’t use pre-calculated data cubes.
 Data can be stored efficiently.
 ROLAP can leverage functionalities inherent in the relational database.

Disadvantages of ROLAP –

 Performance of ROLAP can be slow.


 In ROALP, difficult to maintain aggregate tables.
 Limited by SQL functionalities.
2. Multidimensional Online Analytical Processing (MOLAP) :

MOLAP does not use relational database to storage. It stores in optimized multidimensional array storage. The
storage utilization may be low With multidimensional data stores. Many MOLAP server handle dense and sparse
data sets by using two levels of data storage representation. MOLAP has 3 components: Database Server,
MOLAP server, and Front-end tool.

Advantages of MOLAP –

 MOLAP is basically used for complex calculations.


 MOLAP is optimal for operation such as slice and dice.
 MOLAP allows fastest indexing to the pre-computed summarized data.

Disadvantages of MOLAP –

 MOLAP can’t handle large amount of data.


 In MOLAP, Requires additional investment.
 Without re-aggregation, difficult to change dimension.

3. Hybrid Online Analytical Processing (HOLAP) :

Hybrid is a combination of both ROLAP and MOLAP. It offers functionalities of both ROLAP and as well as
MOLAP like faster computation of MOLAP and higher scalability of ROLAP. The aggregations are stored
separately in MOLAP store. Its server allows storing the large data volumes of detailed information.

Advantages of HOLAP –

 HOLAP provides the functionalities of both MOLAP and ROLAP.


 HOLAP provides fast access at all levels of aggregation.

Disadvantages of HOLAP –

 HOLAP architecture is very complex to understand because it supports both MOLAP and ROLAP.

 Difference between ROLAP, MOLAP and HOLAP:

Basis ROLAP MOLAP HOLAP

Relational Database is
Storage location Multidimensional Database Multidimensional Database
used as storage location
for summary is used as storage location is used as storage location
for summary
aggregation for summary aggregation. for summary aggregation.
aggregation.

Processing time of Processing time of MOLAP Processing time of HOLAP


Processing time
ROLAP is very slow. is fast. is fast.

Storage space Large storage space Medium storage space Small storage space
requirement requirement in ROLAP requirement in MOLAP as requirement in HOLAP as
Basis ROLAP MOLAP HOLAP

as compare to MOLAP compare to ROLAP and compare to MOLAP and


and HOLAP. HOLAP. ROLAP.

Relational database is Multidimensional database Relational database is used


Storage location
used as storage location is used as storage location as storage location for detail
for detail data
for detail data. for detail data. data.

Low latency in ROLAP High latency in MOLAP as Medium latency in HOLAP


Latency as compare to MOLAP compare to ROLAP and as compare to MOLAP and
and HOLAP. HOLAP. ROLAP.

Slow query response


Fast query response time in Medium query response
Query response time in ROLAP as
MOLAP as compare to time in HOLAP as compare
time compare to MOLAP
ROLAP and HOLAP. to MOLAP and ROLAP.
and HOLAP.

 Data Mining Interface Security Backup and Recovery


SECURITY

The objective of a data warehouse is to make large amounts of data easily accessible to
the users, hence allowing the users to extract information about the business as a whole. But we
know that there could be some security restrictions applied on the data that can be an obstacle
for accessing the information. If the analyst has a restricted view of data, then it is impossible to
capture a complete picture of the trends within the business.

The data from each analyst can be summarized and passed on to management where the
different summaries can be aggregated. As the aggregations of summaries cannot be the same as
that of the aggregation as a whole, it is possible to miss some information trends in the data unless
someone is analyzing the data as a whole.

Security Requirements

Adding security features affect the performance of the data warehouse, therefore it is
important to determine the security requirements as early as possible. It is difficult to add
security features after the data warehouse has gone live.

During the design phase of the data warehouse, we should keep in mind what data sources may
be added later and what would be the impact of adding those data sources. We should consider
the following possibilities during the design phase.
 Whether the new data sources will require new security and/or audit restrictions to be
implemented?

 Whether the new users added who have restricted access to data that is already generally
available?

This situation arises when the future users and the data sources are not well known. In such a
situation, we need to use the knowledge of business and the objective of data warehouse to know
likely requirements.

The following activities get affected by security measures −

 User access
 Data load
 Data movement
 Query generation

User Access

We need to first classify the data and then classify the users on the basis of the data they
can access. In other words, the users are classified according to the data they can access.

Data Classification

The following two approaches can be used to classify the data −

 Data can be classified according to its sensitivity. Highly-sensitive data is classified as


highly restricted and less-sensitive data is classified as less restrictive.
 Data can also be classified according to the job function. This restriction allows only
specific users to view particular data. Here we restrict the users to view only that part
of the data in which they are interested and are responsible for.
There are some issues in the second approach. To understand, let's have an example. Suppose
you are building the data warehouse for a bank. Consider that the data being stored in the data
warehouse is the transaction data for all the accounts. The question here is, who is allowed to
see the transaction data. The solution lies in classifying the data according to the function.

User classification

The following approaches can be used to classify the users −

 Users can be classified as per the hierarchy of users in an organization, i.e., users can
be classified by departments, sections, groups, and so on.

 Users can also be classified according to their role, with people grouped across
departments based on their role.

Classification on basis of Department

Let's have an example of a data warehouse where the users are from sales and marketing
department. We can have security by top-to-down company view, with access centered on the
different departments. But there could be some restrictions on users at different levels. This structure
is shown in the following diagram.
But if each department accesses different data, then we should design the security access for
each department separately. This can be achieved by departmental data marts. Since these data marts
are separated from the data warehouse, we can enforce separate security restrictions on each data
mart. This approach is shown in the following figure.

Classification Based on Role

If the data is generally available to all the departments, then it is useful to follow the role access
hierarchy. In other words, if the data is generally accessed by all the departments, then apply
security restrictions as per the role of the user. The role access hierarchy is shown in the following
figure.
Audit Requirements

Auditing is a subset of security, a costly activity. Auditing can cause heavy overheads on the
system. To complete an audit in time, we require more hardware and therefore, it is recommended
that wherever possible, auditing should be switched off. Audit requirements can be categorized
as follows −

 Connections

 Disconnections

 Data access

 Data change

Note − For each of the above-mentioned categories, it is necessary to audit success, failure, or
both. From the perspective of security reasons, the auditing of failures are very important.
Auditing of failure is important because they can highlight unauthorized or fraudulent access.

Network Requirements

Network security is as important as other securities. We cannot ignore the network


security requirement. We need to consider the following issues −

 Is it necessary to encrypt data before transferring it to the data warehouse?

 Are there restrictions on which network routes the data can take?

These restrictions need to be considered carefully. Following are the points to remember −

 The process of encryption and decryption will increase overheads. It would require more
processing power and processing time.

 The cost of encryption can be high if the system is already a loaded system because the
encryption is borne by the source system.

Data Movement

There exist potential security implications while moving the data. Suppose we need to transfer
some restricted data as a flat file to be loaded. When the data is loaded into the data warehouse,
the following questions are raised −

 Where is the flat file stored?

 Who has access to that disk space?


If we talk about the backup of these flat files, the following questions are raised −

 Do you backup encrypted or decrypted versions?

 Do these backups need to be made to special tapes that are stored separately?

 Who has access to these tapes?

Some other forms of data movement like query result sets also need to be considered. The
questions raised while creating the temporary table are as follows −

 Where is that temporary table to be held?

 How do you make such table visible?

We should avoid the accidental flouting of security restrictions. If a user with access to the
restricted data can generate accessible temporary tables, data can be visible to non- authorized
users. We can overcome this problem by having a separate temporary area for users with access
to restricted data.

Documentation

The audit and security requirements need to be properly documented. This will be treated as a
part of justification. This document can contain all the information gathered from −

 Data classification

 User classification

 Network requirements

 Data movement and storage requirements

 All auditable actions

Impact of Security on Design

Security affects the application code and the development timescales. Security affects the
following area −

 Application development

 Database design

 Testing
Application Development

Security affects the overall application development and it also affects the design of the
important components of the data warehouse such as load manager, warehouse manager, and
query manager. The load manager may require checking code to filter record and place them in
different locations. More transformation rules may also be required to hide certain data. Also there
may be requirements of extra metadata to handle any extra objects.

To create and maintain extra views, the warehouse manager may require extra codes to
enforce security. Extra checks may have to be coded into the data warehouse to prevent it from
being fooled into moving data into a location where it should not be available. The query manager
requires the changes to handle any access restrictions. The query manager will need to be aware
of all extra views and aggregations.

Database design

The database layout is also affected because when security measures are implemented,
there is an increase in the number of views and tables. Adding security increases the size of the
database and hence increases the complexity of the database design and management. It will also
add complexity to the backup management and recovery plan.

Testing

Testing the data warehouse is a complex and lengthy process. Adding security to the data
warehouse also affects the testing time complexity. It affects the testing in the following two
ways −

 It will increase the time required for integration and system testing.

 There is added functionality to be tested which will increase the size of the testing suite.

 BACKUP & RECOVERY

A data warehouse is a complex system and it contains a huge volume of data. Therefore
it is important to back up all the data so that it becomes available for recovery in future as per
requirement. In this chapter, we will discuss the issues in designing the backup strategy.

Backup Terminologies

Before proceeding further, you should know some of the backup terminologies discussed below.

 Complete backup − It backs up the entire database at the same time. This backup includes
all the database files, control files, and journal files.
 Partial backup − As the name suggests, it does not create a complete backup of the
database. Partial backup is very useful in large databases because they allow a strategy
whereby various parts of the database are backed up in a round-robin fashion on a day-
to-day basis, so that the whole database is backed up effectively once a week.

 Cold backup − Cold backup is taken while the database is completely shut down. In
multi-instance environment, all the instances should be shut down.

 Hot backup − Hot backup is taken when the database engine is up and running. The
requirements of hot backup varies from RDBMS to RDBMS.

 Online backup − It is quite similar to hot backup.

Hardware Backup

It is important to decide which hardware to use for the backup. The speed of processing the
backup and restore depends on the hardware being used, how the hardware is connected,
bandwidth of the network, backup software, and the speed of server's I/O system. Here we will
discuss some of the hardware choices that are available and their pros and cons. These choices
are as follows −

 Tape Technology

 Disk Backups

Tape Technology

The tape choice can be categorized as follows −

 Tape media
 Standalone tape drives
 Tape stackers
 Tape silos
Tape Media

There exists several varieties of tape media. Some tape media standards are listed in the table
below −

Tape Media Capacity I/O rates

DLT 40 GB 3 MB/s

3490e 1.6 GB 3 MB/s

8 mm 14 GB 1 MB/s

Other factors that need to be considered are as follows −

 Reliability of the tape medium


 Cost of tape medium per unit
 Scalability
 Cost of upgrades to tape system
 Cost of tape medium per unit
 Shelf life of tape medium

Standalone Tape Drives

The tape drives can be connected in the following ways −

 Direct to the server

 As network available devices

 Remotely to other machine

There could be issues in connecting the tape drives to a data warehouse.

 Consider the server is a 48node MPP machine. We do not know the node to connect the tape
drive and we do not know how to spread them over the server nodes to get the optimal
performance with least disruption of the server and least internal I/O latency.
 Connecting the tape drive as a network available device requires the network to be up to
the job of the huge data transfer rates. Make sure that sufficient bandwidth is available
during the time you require it.

 Connecting the tape drives remotely also require high bandwidth.

Tape Stackers

The method of loading multiple tapes into a single tape drive is known as tape stackers.
The stacker dismounts the current tape when it has finished with it and loads the next tape, hence
only one tape is available at a time to be accessed. The price and the capabilities may vary, but
the common ability is that they can perform unattended backups.

Tape Silos

Tape silos provide large store capacities. Tape silos can store and manage thousands of
tapes. They can integrate multiple tape drives. They have the software and hardware to label and
store the tapes they store. It is very common for the silo to be connected remotely over a network
or a dedicated link. We should ensure that the bandwidth of the connection is up to the job.

Disk Backups

Methods of disk backups are −

 Disk-to-disk backups

 Mirror breaking

These methods are used in the OLTP system. These methods minimize the database
downtime and maximize the availability.

Disk-to-Disk Backups

Here backup is taken on the disk rather on the tape. Disk-to-disk backups are done for the
following reasons −

 Speed of initial backups

 Speed of restore

Backing up the data from disk to disk is much faster than to the tape. However it is the
intermediate step of backup. Later the data is backed up on the tape. The other advantage of
disk-to-disk backups is that it gives you an online copy of the latest backup.
Mirror Breaking

The idea is to have disks mirrored for resilience during the working day. When backup is
required, one of the mirror sets can be broken out. This technique is a variant of disk-to-disk
backups.

Note − The database may need to be shutdown to guarantee consistency of the backup.

Optical Jukeboxes

Optical jukeboxes allow the data to be stored near line. This technique allows a large
number of optical disks to be managed in the same way as a tape stacker or a tape silo. The
drawback of this technique is that it has slow write speed than disks. But the optical media
provides long-life and reliability that makes them a good choice of medium for archiving.

Software Backups

There are software tools available that help in the backup process. These software tools
come as a package. These tools not only take backup, they can effectively manage and control
the backup strategies. There are many software packages available in the market. Some of them
are listed in the following table −

Package Name Vendor

Networker Legato

ADSM IBM

Epoch Epoch Systems

Omniback II HP

Alexandria Sequent

Criteria for Choosing Software Packages


The criteria for choosing the best software package are listed below −
 How scalable is the product as tape drives are added?

 Does the package have client-server option, or must it run on the database server
itself?

 Will it work in cluster and MPP environments?

 What degree of parallelism is required?

 What platforms are supported by the package?

 Does the package support easy access to information about tape contents?

 Is the package database aware?

 What tape drive and tape media are supported by the package?

 SERVICE LEVEL AGREEMENT (SLA)

A service level agreement (SLA) is essential to the design process of the data
warehouse. In particular, it is essential to the design of the backup strategy. You need the
SLA as a guide to the rates of backup that are required, and more importantly the rate at
which a restore needs to be accomplished. The SLA affects not just the backup, but also
such fundamental design decisions as partitioning of the fact data.

Definition of Types of System

There are a number of terms that are frequently used when talking about large

systems such as data warehouses:

• Operational

• Mission critical

• 7 x 24

• 7 x 24 x 52

The meaning of these terms should be clear from the names, but the terms are often
loosely used — some would say even misused — so we shall define what we mean by
them.
• An operational system is a system that has responsibilities to the operations of
the business.

• A mission critical system is a system that the business absolutely depends on to


function.
• A 7 x 24 system is a system that needs to be available all day every day,
except for small periods of planned downtime.

• A 7 x 24 x 52 system is true 7 x 24 systems, which is required to be running all


the time.

Defining the SLA

These topics can be further divided into two categories :

• User requirements

• System requirements

User requirements are the elements that directly affect the users, such as hours of
access and response times. System requirements are the needs imposed on the system by
the business such as system availability.

User Requirements

One of the key sets of requirements to capture during the analysis phase is the user
requirements. Detailed information is required on the following:

• User online access — hours of work

• User batch access

• User expected response times

• Average response times

• Maximum acceptable response times

System Requirements

The key system requirement that needs to be measured is the maximum acceptable
downtime for the system. These questions can be asked from the viewpoints both of user
discomfort and of business damage. However, the reality is that in systems of this size and
complexity you have to work to the business damage limit, not the user discomfort limit,
otherwise costs complexity soar. The SLA needs to stipulate fully any measures related to
downtime. In particular, some measure of the required availability of the server is needed.
Availability can be measured in a number of different ways, but however you state
it in the SLA it must be unambiguous. It is normal for availability to be measured as a
required percentage of uptime. In other words:

D = 100 - A

where D = acceptable downtime, and A is the percentage required availability.

Note 'that D covers both planned and unplanned downtime. This can be further
broken down into acceptable online downtime (Du), and acceptable offline downtime (D1).
These can be defined as.

D„ - 100 — A n

D1= 100 — A1

Where an if the percentage of N for which the system is required to be available; A is


the percentage of 24-N) hours for which the system is required to be available; and N is the
number of hours in the user/online day.

 OPERATING THE DATA WAREHOUSE

The operation of a data warehouse is a non-trivial task. The design of the operational
environment is no less challenging., the design is not made any easier by the need for
operations to be as automated as possible. Day-To Day Operations of
the Data Warehouse

The daily operations of the data warehouse are equally as comp but less fraught than the
overnight processing. The main purpose 1 daytime usage of the machine is to service the users'
queries Some other operations that can occur during the day are:

• Monitoring application daemons • Query management • Batch • Ad hoc •


Backup/recovery • Housekeeping scripts • Data feeds • Drip feed of load files • Data extracts •
Query results • Data marts • Performance monitoring • Event management • Job scheduling

The warehouse application managers are likely to have a number of service or


background process running. These processes are often called daemons. These daemons are
used to log events, act as servers and so on.
Query management is a vital part of the daily operations. Some of the monitoring and control
can be automated, but there will still be a requirement for a DBA to be available to deal with
any problems.

Backups are usually avoided during the user day, to avoid contentions with the users, queries.
In addition, in a data warehouse the bulk of the data is read-only, and there is little that
happens during working hours that will require backing up.

There are always housekeeping tasks the need to be run. Scripts that purge down
directories, log off idle connections and so on are as much a part of a data warehouse as of
any other system.

The data feeds may arrive at any time of the day. Having some of the data arrive during
the day is not an issue unless it causes network problems.

Data extracts are subsets of data that are offloaded form the server machine onto
other machines. Extracts can be unscheduled user extracts off some query results, or they
can be scheduled extracts such as data mart refreshes. .

Extraction of user query results can impose a significant load on the network, and
has to be controlled.

Data marts are regular and, as such, more controllable data extracts. as with other
extracts, they will incur network costs, unless they are transferred via some medium such as
tape.

Performance monitoring, event monitoring and scheduling are all ongoing and
highly automate operations. They should require minimal manual input to maintain. Any
manual interfacing should be via a menu system or the relevant GUI front end.

Logins for each user will need to be created, maintained, and deleted. This is a task
that should be menu driven to prevent accidents. Adding a user with the wrong privileges can
prove to be costly mistake. The users should all fall into easy categories or groups, allowing
template login to be created for each group. This template should have all the default accesses,
profiles, roles and so on for that group.

Log files, which log ongoing events, and trace files, which dump information on specific
processes, will generate a vast amount of data, and can occupy a log of disk space. Some
trace files will have errors or race information from unsolved problems; other may have
performance statistics form some recent test runs.
Some trace information is worth keeping in its own right

Starting up and shutting down of the server, database or the data warehouse
applications are tasks that are likely to he performed infrequently. They are, however, tasks
that it is important to get right, because shutting down the machine or database incorrectly
can cause problems on restart. Each will need scripted and menu-driven procedures for
shutting it down, and starting it up.

Printing requirements will need to be established. From experience, printing is not a


common data warehouse requirement. Sets of most require are too large to printing
meaningfully.

Problem management is an area that needs to be clearly defined and documented, it is


vital that all administration staff know who, where and when to call if problems arise. It is
common to have several groups inside an organization responsible for different parts of the
data warehouse.

Upgrading a data warehouse is always a problem. As the system becomes a victim of


its own success, it will become harder and harder to find extended periods of time when the
system can be down.

Overnight Processing

There are a number of major issues that need to be addressed, if this is not to
become a stumbling block to the success of the data warehouse.

The key issue is keeping to the time window and ensuring that you do not eat into
the next business day. The sheer volume of work that has to be accomplished, and the
serial nature of many of the operation, makes this more difficult than it may first seem.

The tasks that need to be accomplished overnight are, in order:

1. Data rollup, 2. Obtaining the data 3. Data transformation 4. Daily load. 5.


Data cleanup 6. Index creation 7. Aggregation creation and—maintenance, 8.
Backup 9. Data Archiving 10. Data mart refresh.

In data rollup order data is rolled up into aggregated from to save space. This
operation is not dependent on any of the other operations in the list.

The first real step in the overnight processing is obtaining the data. This in itself is probably
a simple operation; the problems arise when the data is delayed or cannot be transferred.

Data transformation this will vary from data warehouse to data warehouse and it is
possible that you will have no transformation at all to perform.
The load step is again a potential serialization point in the overnight processing. The
load itself can be parallelized, but the data cleanup cannot process until the data is in the
database.

The data cleanup processing required will again vary from data warehouse to data
warehouse.

Aggregation creation and maintenance, cannot, really process until all the data has
arrives, and has been transformed, loaded and cleaned.

The final overnight operation in most systems will be the backup, ['his has to be last,
because some of the most important items that need be backup are the new and modified
aggregations. The backup can by interlaced with other operations, because the newly loaded
data can his backup as soon as it hits the database.

If data has to be archived, this process will usually get run as part or the backup.
Depending on the frequency of the data archiving, this on prove to be a major overhead.

Data marts and other data extracts- if they exist- may have to he refreshed on a regular
basis. You need to ascertain exactly when suvii extracts need to be available, if possible, and
how regularly they need I be refreshed.

 Planning, Tuning and Testing

 CAPACITY PLANNING

Process
The capacity plan for a data warehouse is defined within the technical blueprint stage of
the process. The business requirements stage should have identified the approximate sizes for
data, users, and any other issues that constrain system performance.

It is important to have a clear understanding of the usage profiles of all users of the
data warehouse for each user or group of users you need to know the following:

 The number of users in the group;


 Whether they use ad hoc queries frequently
 Whether they use ad hoc queries occasionally at unknown intervals
 Whether they use ad hoc queries occasionally at regular and predictable times
 The average size, of query they tend to run;
 The maximum size of query they tend to run;
 The elapsed login time per day;
 The peak time of daily usage;
 The number of queries they run per peak hour;
 The number of queries they run per day.
Estimating the Load

A When choosing the hardware for the data warehouse there arc intItiy things- Ito-
consider such as hardware architecture, resilience. and so on, One poilif to6- feriiember is
that the data warehouse will probably groW rapidly from its initial configuration, so it is not
sufficient to consider 1110 initial size of the data warehouse.

Initial Configuration

When sizing the initial configuration you will have. no information or statistics to
work with, and the sizing will need to be done on the predicted load. Estimation this load is
difficult, because there is an as of element to it. However, if the phased approach is taken,
the as hoc element will be grown to meet the demand.

How much CPU bandwidth?

To start with you need to consider the distinct loads that will be placed on the system.
There are many aspects to this, such as query load, data load, backup and so on, but essentially
the load can be divided into two distinct phase: • daily processing • overnight processing
There are discussed in part Three, but essentially they break down as follows;

• daily processing • user query processing • overnight processing • data


transformation and load • aggregation and index creation • backup

Daily processing

The daily processing is centered around the user queries. To estimate the CPU
requirements you need to estimate the time that each query will take. As much of the query load
will be ad hoc it is impossible to estimate the requirement of every query; therefore another
approach has to be found.

The first thing to do is estimate the size of the largest likely common query. It is
possible that some user will want to query across every piece of data in the data in the data
warehouse, but this will probably not be a common requirement. It is more likely that the
users will want to query the most recent week or month's worth of data.

The progress any further we need to know the I/O characteristics of the devices that
the data will reside on. This allows us to calculate the scan rate S at which the fact data
can be read. This will depend on the disk speeds and on the throughput ratings of the
controllers. Clearly this also depends on the size of f itself.

Using S and F you can calculate T, the time in seconds to perform a full table scan of
the period in question.,

T = F/S
In fact you should calculate a number of times, Ti-ta, which depend on the degree of
parallelism that you are using to perform the scan. Therefore we get

T1 = F/ Si

Tn= F /Sn

where s 1 is the scan speed of a single disk or striped disk set, and Sn

is the scan speed of all the disks or disk sets that F is spread across. Overnight processing

The CPU bandwidth required for the data transformation will depend on the amount
of data processing that is involved. Unless there is an enormous amount of data
transformation, it is unlikely that this operation will require more CPU bandwidth than the
aggregation and index creation operation.

If users are allowed to leave queries running overnight, or if queries are allowed to run
for 24 hours or more, you will also need to make a separate allowance for that processing, above
and beyond the calculated CPU requirement for the overnight processing.

How Much Memory?

Memory is commodity that you can never have enough of. What you need to estimate
is the minimum requirements. There are a number of things that affect the amount of memory
required.

First, there are the database requirements. The database will need memory to cache
data blocks as they are used; it will also need memory 41 to cache parsed SQL statements and
so on. There requirements will vary from RDBMS to RDBMS, and you will need to work
them out for whatever software you are using.

Secondly, each user connected to the system will use an amount of memory: how
much will depend on how they are connected to the system and what software they are
running. As it is likely that the users will be connected in a client-server mode the user memory
requirement may be quite small

Finally, the operating system will require an amount of memory. This will vary with the
operating system and the features and tools you are running. You can get the hardware vendor to
estimate how much memory the system will use.

How much disk?

The largest calculation you will need to perform is the amount of disk space required. This
can be a very tricky thing to calculate, and how successfully you do it will depend on how
successfully the analysis captures the requirements.
The disk requirement can be broken down into the following categories:

• database requirements

• administration

• fact and dimension data

• aggregations

• non — data requirements

• operating system requirements

• other software requirements

• data warehouse requirements

• User requirements.

Database sizing

The database will occupy most of the required disk space, and when sizing it you need
to be sure you get it right. There are a number of aspects to the database sizing that need to
be considered.

 Tuning the Data warehouse

Tuning the Data Load

Because the data load is the entry point into the system, it provides le first
opportunity to improve performance. Essentially the flow of data as depicted in figure.

As shown in figure, there are opportunities to transform the data at very stage of this data
flow. If all data transformation or integrity checks occur before the data arrives at
the data warehouse system, all you need to know is the exception time of arrival. If the checks are
performed on the data warehouse system, either before after the data is loaded. or indeed a
combination of both, they will have a direct effect on the capacity and performance of the
system. Any tuning of the integrity checks will depend on what form they take.

It is not unknown to apply rudimentary checks to the data being loaded. For example,
if the data is telephone call records. You may want to check the each call has a valid customer
identifier. If the data is retail information, you may want to check whether the commodity
being sold has a valid product identifier. One point to note is that just because some of the
loading data fails such a check it does not necessarily mean that the data is invalid for the call
records, it could easily be that the data is correct and that the customer information is not up
to data. This can happen when a new customer is added as a service subscriber. They will
probably use the phone immediately the service is connected, but it may take several days for
the customer information to trickle through to the data warehouse from the customer systems.

There is little that can be done to tune any business rules enforced by constrains. If the
rules are enforced by using SQL or by trigger code that code needs to be tuned to maximal
efficiently.

Loading large quantities of data as well as being a heavy I/O operation can be CPU
intensive, especially if there are a lot of checks and transformations to be applied
to each record. As mentioned above, the load speed can be improved by using direct load
techniques. The load can also be improved by using parallelism. Multiple processes can be
used to speed the load. This helps to spread the CPU load amongst any available CPUs.

If using multiple loads, care is needed to avoid introducing I/O bottlenecks. Ensure
that the source data file does not become a source of contention. Ideally, the source data
should be split into multiple files so that each load process can have its own source file. The
source files should also be spread over multiple disks to avoid contention on a single disk or
I/O controller.

The load destination can also become a bottleneck, particularly if a large number of
load processes are being used. This needs to be considered as part of the original database design.
Each table to be loaded should be spread over multiple spindles. This can be done by striping
the database files across multiple disks, or by spreading the table to be loaded over multiple
database files.

Prioritized Tuning Steps

The following steps provide a recommended method for tuning an oracle database.
These steps are prioritized in order of diminishing returns: steps with the greatest effect on
performance appear first.

For optimal results, therefore, resolve tuning issues in the order listed: from the design
and development phase through instance tuning.

Step 1: Tune the business rules


Step 2: Tune application design
Step 3: Tune the data design

Step 4: Tune the logical structure of the database Step


5: Tune database operations

Step 6: Tune the access paths Step


7: Tune memory allocation

Step 8: Tune I/O and physical structure Step


9: Tune resource contention

Step 10: Tune the underlying platforms


After completing these steps, reassess your database performance and decide whether
further tuning is necessary.

Tuning is an iterative process. Performance gains made in later steps may pave the
way for further improvements in earlier steps, so additional passes through the tuning process
may useful.

Tuning Queries

The data warehouse will contain two types of query.

1. Fixed queries

2. Ad hoc queries

Both types of query need to be tuned, but they require different techniques dealing
with them.

Fixed queries

Tuning the fixed queries is no different than the traditional tuning of a relational
database. They have predictable requirements, and it can be tested to find the best execution
plan. The only real variables with these queries are the size of the data being queried and the
data skew. Even these variables can be dealt with, because the query can be tested as often as
desired.

AD HOC queries

The number of users of the data warehouse will have a profit effect on the
performance, in particular if they are ad hoc tog It is important have a clear understanding of
the usage profiles of all re users. For each user or group of users you need to know the
following

Data Warehousing and Data mining

• The number of users in the group

• Whether they use ad hoc queries frequently

• Whether they use ad hoc occasionally at unknown intervals

• Whether they use ad hoc queries occasionally at regular predictable times.


• The average size of query they tend to run.

• The maximum size of query they tend to run

• The elapsed login time per day

• The peak time of daily usage

• The number of queries they run per peak hour;

• Whether they require drill-down access to the base data.

 TESTING THE DATA WAREHOUSE

As with any development project, a data warehouse will need, only more so the complexity
and size of data warehouse system comprehensive testing both more difficult and more nt
The fact that queries that take minutes to run on small scale testing sufficient. It is necessary
to establish that queries scale with the upshot is that comprehensive testing of a data warehouse
take time. This makes predicting test schedules difficult.

The Testing Terminologies

The various terminologies during are:

1. Unit testing: to ensure that the component or module and behaves as per the
specified requirements.

2. System testing: to ensure that the interactions between the components or module
function correctly as per the specified requirements.

3. Integration testing : to test the solution from an end-to-end perspective and ensure that
the system works in a production-like environment

4. Acceptance testing; to verify that the entire solution meets the business
requirements and successfully supports the business processes from a user's perspective.

5. System assurance testing; to ensure and verify the operational readiness of the
system in a production environment. This is also referred to as the warranty period coverage.

Beside the above mentioned testing phases, there are a few other popular categories of
testing phases.
White box testing: knowledge of program structure and business rules is used formulate
test cases,. Also known as glass box, structural, clear box and open box testing. A testing
technique whereby explicit knowledge of the internal working of the item (program or software)
is being tested is used to select the test data.

Black box testing: appropriate for system testing, where the system is considered as
a black box. The test classes are derived from the requirements. This technique is also known
as functional testing where in the internal working of the item (program and software) are
being tested us bit by the tester.

Testing the operational environment

Testing of the data warehouse operational environment is another key set of tests that
will have to be performed. There are a number of aspects that need to be tested:

• Security

• Disk configuration

• Scheduler

• Management tools

• Database management

Testing is very important for data warehouse systems to make them work correctly and
efficiently. There are three basic levels of testing performed on a data warehouse −

 Unit testing
 Integration testing
 System testing

Unit Testing

 In unit testing, each component is separately tested.

 Each module, i.e., procedure, program, SQL Script, Unix shell is tested.
 This test is performed by the developer.

Integration Testing

 In integration testing, the various modules of the application are brought together and
then tested against the number of inputs.
 It is performed to test whether the various components do well after integration.
System Testing

 In system testing, the whole data warehouse application is tested together.


 The purpose of system testing is to check whether the entire system works correctly
together or not.
 System testing is performed by the testing team.
 Since the size of the whole data warehouse is very large, it is usually possible to perform
minimal system testing before the test plan can be enacted.

Test Schedule

First of all, the test schedule is created in the process of developing the test plan. In this
schedule, we predict the estimated time required for the testing of the entire data warehouse system.

There are different methodologies available to create a test schedule, but none of them are perfect
because the data warehouse is very complex and large. Also the data warehouse system is
evolving in nature. One may face the following issues while creating a test schedule −

 A simple problem may have a large size of query that can take a day or more to complete,
i.e., the query does not complete in a desired time scale.
 There may be hardware failures such as losing a disk or human errors such as accidentally
deleting a table or overwriting a large table.

Note − Due to the above-mentioned difficulties, it is recommended to always double the


amount of time you would normally allow for testing.

Testing Backup Recovery

Testing the backup recovery strategy is extremely important. Here is the list of
scenarios for which this testing is needed −

 Media failure
 Loss or damage of table space or data file
 Loss or damage of redo log file
 Loss or damage of control file
 Instance failure
 Loss or damage of archive file
 Loss or damage of table
 Failure during data failure

Testing Operational Environment

There are a number of aspects that need to be tested. These aspects are listed below.

 Security − A separate security document is required for security testing. This document
contains a list of disallowed operations and devising tests for each.

 Scheduler − Scheduling software is required to control the daily operations of a data


warehouse. It needs to be tested during system testing. The scheduling software requires
an interface with the data warehouse, which will need the scheduler to control overnight
processing and the management of aggregations.

 Disk Configuration. − Disk configuration also needs to be tested to identify I/O


bottlenecks. The test should be performed with multiple times with different settings.

 Management Tools. − It is required to test all the management tools during system
testing. Here is the list of tools that need to be tested.
o Event manager
o System manager
o Database manager
o Configuration manager
o Backup recovery manager

Testing the Database

The database is tested in the following three ways −

 Testing the database manager and monitoring tools − To test the database manager
and the monitoring tools, they should be used in the creation, running, and management
of test database.

 Testing database features − Here is the list of features that we have to test −

o Querying in parallel

o Create index in parallel

o Data load in parallel


 Testing database performance − Query execution plays a very important
role in data warehouse performance measures. There are sets of fixed
queries that need to be run regularly and they should be tested. To test ad hoc
queries, one should go through the user requirement document and understand
the business completely. Take time to test the most awkward queries that
the business is likely to ask against different index and aggregation
strategies.

Testing the Application

 All the managers should be integrated correctly and work in order to ensure
that the end-to-end load, index, aggregate and queries work as per the
expectations.

 Each function of each manager should work correctly

 It is also necessary to test the application over a period of time.

 Week end and month-end tasks should also be tested.

Logistic of the Test

The aim of system test is to test all of the following areas −

 Scheduling software

 Day-to-day operational procedures

 Backup recovery strategy

 Management and scheduling tools

 Overnight processing

 Query performance

Note − The most important point is to test the scalability. Failure to do so will
leave us a system design that does not work when the system grows.

 Data Mining - Applications & Trends

Data mining is widely used in diverse areas. There are a number of commercial data mining system
available today and yet there are many challenges in this field. In this tutorial, we will discuss the
applications and the trend of data mining.

 Data Mining Applications

Dr. Nikhat Akhtar Page 39


Here is the list of areas where data mining is widely used −

 Financial Data Analysis


 Retail Industry
 Telecommunication Industry
 Biological Data Analysis
 Other Scientific Applications
 Intrusion Detection

 Financial Data Analysis

The financial data in banking and financial industry is generally reliable and of high quality which
facilitates systematic data analysis and data mining. Some of the typical cases are as follows −

 Design and construction of data warehouses for multidimensional data analysis and data
mining.
 Loan payment prediction and customer credit policy analysis.
 Classification and clustering of customers for targeted marketing.
 Detection of money laundering and other financial crimes.

 Retail Industry

Data Mining has its great application in Retail Industry because it collects large amount of data
from on sales, customer purchasing history, goods transportation, consumption and services. It is
natural that the quantity of data collected will continue to expand rapidly because of the increasing
ease, availability and popularity of the web.

Data mining in retail industry helps in identifying customer buying patterns and trends that lead to
improved quality of customer service and good customer retention and satisfaction. Here is the list
of examples of data mining in the retail industry −

 Design and Construction of data warehouses based on the benefits of data mining.
 Multidimensional analysis of sales, customers, products, time and region.
 Analysis of effectiveness of sales campaigns.
 Customer Retention.
 Product recommendation and cross-referencing of items.

 Telecommunication Industry

Today the telecommunication industry is one of the most emerging industries providing various
services such as fax, pager, cellular phone, internet messenger, images, e-mail, web data
transmission, etc. Due to the development of new computer and communication technologies, the
telecommunication industry is rapidly expanding. This is the reason why data mining is become
very important to help and understand the business.

Data mining in telecommunication industry helps in identifying the telecommunication patterns,


catch fraudulent activities, make better use of resource, and improve quality of service. Here is the
list of examples for which data mining improves telecommunication services −

 Multidimensional Analysis of Telecommunication data.

Dr. Nikhat Akhtar Page 40


 Fraudulent pattern analysis.
 Identification of unusual patterns.
 Multidimensional association and sequential patterns analysis.
 Mobile Telecommunication services.
 Use of visualization tools in telecommunication data analysis.

 Biological Data Analysis

In recent times, we have seen a tremendous growth in the field of biology such as genomics,
proteomics, functional Genomics and biomedical research. Biological data mining is a very
important part of Bioinformatics. Following are the aspects in which data mining contributes for
biological data analysis −

 Semantic integration of heterogeneous, distributed genomic and proteomic databases.


 Alignment, indexing, similarity search and comparative analysis multiple nucleotide
sequences.
 Discovery of structural patterns and analysis of genetic networks and protein pathways.
 Association and path analysis.
 Visualization tools in genetic data analysis.

 Other Scientific Applications

The applications discussed above tend to handle relatively small and homogeneous data sets for
which the statistical techniques are appropriate. Huge amount of data have been collected from
scientific domains such as geosciences, astronomy, etc. A large amount of data sets is being
generated because of the fast numerical simulations in various fields such as climate and ecosystem
modelling, chemical engineering, fluid dynamics, etc. Following are the applications of data mining
in the field of Scientific Applications −

 Data Warehouses and data preprocessing.


 Graph-based mining.
 Visualization and domain specific knowledge.

 Intrusion Detection

Intrusion refers to any kind of action that threatens integrity, confidentiality, or the availability of
network resources. In this world of connectivity, security has become the major issue. With
increased usage of internet and availability of the tools and tricks for intruding and attacking
network prompted intrusion detection to become a critical component of network administration.
Here is the list of areas in which data mining technology may be applied for intrusion detection −

 Development of data mining algorithm for intrusion detection.


 Association and correlation analysis, aggregation to help select and build discriminating
attributes.
 Analysis of Stream data.
 Distributed data mining.
 Visualization and query tools.

 Data Mining System Products

Dr. Nikhat Akhtar Page 41


There are many data mining system products and domain specific data mining applications. The
new data mining systems and applications are being added to the previous systems. Also, efforts are
being made to standardize data mining languages.

 Choosing a Data Mining System

The selection of a data mining system depends on the following features −

 Data Types − The data mining system may handle formatted text, record-based data, and
relational data. The data could also be in ASCII text, relational database data or data
warehouse data. Therefore, we should check what exact format the data mining system can
handle.
 System Issues − We must consider the compatibility of a data mining system with different
operating systems. One data mining system may run on only one operating system or on
several. There are also data mining systems that provide web-based user interfaces and
allow XML data as input.
 Data Sources − Data sources refer to the data formats in which data mining system will
operate. Some data mining system may work only on ASCII text files while others on
multiple relational sources. Data mining system should also support ODBC connections or
OLE DB for ODBC connections.
 Data Mining functions and methodologies − There are some data mining systems that
provide only one data mining function such as classification while some provides multiple
data mining functions such as concept description, discovery-driven OLAP analysis,
association mining, linkage analysis, statistical analysis, classification, prediction,
clustering, outlier analysis, similarity search, etc.
 Coupling data mining with databases or data warehouse systems − Data mining systems
need to be coupled with a database or a data warehouse system. The coupled components
are integrated into a uniform information processing environment. Here are the types of
coupling listed below −
o No coupling
o Loose Coupling
o Semi tight Coupling
o Tight Coupling
 Scalability − There are two scalability issues in data mining −
o Row (Database size) Scalability − A data mining system is considered as row
scalable when the number or rows are enlarged 10 times. It takes no more than 10
times to execute a query.
o Column (Dimension) Salability − A data mining system is considered as column
scalable if the mining query execution time increases linearly with the number of
columns.
 Visualization Tools − Visualization in data mining can be categorized as follows −
o Data Visualization
o Mining Results Visualization
o Mining process visualization
o Visual data mining
 Data Mining query language and graphical user interface − An easy-to-use graphical
user interface is important to promote user-guided, interactive data mining. Unlike relational
database systems, data mining systems do not share underlying data mining query language.

 Trends in Data Mining

Dr. Nikhat Akhtar Page 42


Data mining concepts are still evolving and here are the latest trends that we get to see in this field

 Application Exploration.
 Scalable and interactive data mining methods.
 Integration of data mining with database systems, data warehouse systems and web database
systems.
 Standardization of data mining query language.
 Visual data mining.
 New methods for mining complex types of data.
 Biological data mining.
 Data mining and software engineering.
 Web mining.
 Distributed data mining.
 Real time data mining.
 Multi database data mining.
 Privacy protection and information security in data mining.

 Types of Warehousing Applications, Web Mining

 Types of Data Warehouses

There are different types of data warehouses, which are as follows:

 Host-Based Data Warehouses

Dr. Nikhat Akhtar Page 43


There are two types of host-based data warehouses which can be implemented:

 Host-Based mainframe warehouses which reside on a high volume database. Supported by robust
and reliable high capacity structure such as IBM system/390, UNISYS and Data General sequent
systems, and databases such as Sybase, Oracle, Informix, and DB2.
 Host-Based LAN data warehouses, where data delivery can be handled either centrally or from the
workgroup environment. The size of the data warehouses of the database depends on the platform.

Data Extraction and transformation tools allow the automated extraction and cleaning of data from
production systems. It is not applicable to enable direct access by query tools to these categories of
methods for the following reasons:

1. A huge load of complex warehousing queries would possibly have too much of a harmful impact
upon the mission-critical transaction processing (TP)-oriented application.
2. These TP systems have been developing in their database design for transaction throughput. In all
methods, a database is designed for optimal query or transaction processing. A complex business
query needed the joining of many normalized tables, and as result performance will usually be poor
and the query constructs largely complex.
3. There is no assurance that data in two or more production methods will be consistent.

 Host-Based (MVS) Data Warehouses

Those data warehouse uses that reside on large volume databases on MVS are the host-based types
of data warehouses. Often the DBMS is DB2 with a huge variety of original source for legacy
information, including VSAM, DB2, flat files, and Information Management System (IMS).

Before embarking on designing, building and implementing such a warehouse, some further
considerations must be given because

1. Such databases generally have very high volumes of data storage.


2. Such warehouses may require support for both MVS and customer-based report and query facilities.
3. These warehouses have complicated source systems.
4. Such systems needed continuous maintenance since these must also be used for mission-critical
objectives.

Dr. Nikhat Akhtar Page 44


To make such data warehouses building successful, the following phases are generally followed:

1. Unload Phase: It contains selecting and scrubbing the operation data.


2. Transform Phase: For translating it into an appropriate form and describing the rules for accessing
and storing it.
3. Load Phase: For moving the record directly into DB2 tables or a particular file for moving it into
another database or non-MVS warehouse.

An integrated Metadata repository is central to any data warehouse environment. Such a facility is
required for documenting data sources, data translation rules, and user areas to the warehouse. It
provides a dynamic network between the multiple data source databases and the DB2 of the
conditional data warehouses.

A metadata repository is necessary to design, build, and maintain data warehouse processes. It
should be capable of providing data as to what data exists in both the operational system and data
warehouse, where the data is located. The mapping of the operational data to the warehouse fields
and end-user access techniques. Query, reporting, and maintenance are another indispensable
method of such a data warehouse. An MVS-based query and reporting tool for DB2.

 Host-Based (UNIX) Data Warehouses

Oracle and Informix RDBMSs support the facilities for such data warehouses. Both of these
databases can extract information from MVS¬ based databases as well as a higher number of other
UNIX¬ based databases. These types of warehouses follow the same stage as the host-based MVS
data warehouses. Also, the data from different network servers can be created. Since file attribute
consistency is frequent across the inter-network.

 LAN-Based Workgroup Data Warehouses

A LAN based workgroup warehouse is an integrated structure for building and maintaining a data
warehouse in a LAN environment. In this warehouse, we can extract information from a variety of
sources and support multiple LAN based warehouses, generally chosen warehouse databases to
include DB2 family, Oracle, Sybase, and Informix. Other databases that can also be contained
through infrequently are IMS, VSAM, Flat File, MVS, and VH.

Dr. Nikhat Akhtar Page 45


Designed for the workgroup environment, a LAN based workgroup warehouse is optimal for any
business organization that wants to build a data warehouse often called a data mart. This type of
data warehouse generally requires a minimal initial investment and technical training.

Data Delivery: With a LAN based workgroup warehouse, customer needs minimal technical
knowledge to create and maintain a store of data that customized for use at the department, business
unit, or workgroup level. A LAN based workgroup warehouse ensures the delivery of information
from corporate resources by providing transport access to the data in the warehouse.

 Host-Based Single Stage (LAN) Data Warehouses

Within a LAN based data warehouse, data delivery can be handled either centrally or from the
workgroup environment so business groups can meet process their data needed without burdening
centralized IT resources, enjoying the autonomy of their data mart without comprising overall data
integrity and security in the enterprise.

Limitations

Both DBMS and hardware scalability methods generally limit LAN based warehousing solutions.

Many LAN based enterprises have not implemented adequate job scheduling, recovery
management, organized maintenance, and performance monitoring methods to provide robust
warehousing solutions.

Often these warehouses are dependent on other platforms for source record. Building an
environment that has data integrity, recoverability, and security require careful design, planning,
and implementation. Otherwise, synchronization of transformation and loads from sources to the
server could cause innumerable problems.

A LAN based warehouse provides data from many sources requiring a minimal initial investment
and technical knowledge. A LAN based warehouse can also work replication tools for populating
and updating the data warehouse. This type of warehouse can include business views, histories,
aggregation, versions in, and heterogeneous source support, such as

 DB2 Family
 IMS, VSAM, Flat File [MVS and VM]

Dr. Nikhat Akhtar Page 46


A single store frequently drives a LAN based warehouse and provides existing DSS applications,
enabling the business user to locate data in their data warehouse. The LAN based warehouse can
support business users with complete data to information solution. The LAN based warehouse can
also share metadata with the ability to catalog business data and make it feasible for anyone who
needs it.

 Multi-Stage Data Warehouses

It refers to multiple stages in transforming methods for analyzing data through aggregations. In
other words, staging of the data multiple times before the loading operation into the data
warehouse, data gets extracted form source systems to staging area first, then gets loaded to data
warehouse after the change and then finally to departmentalized data marts.

This configuration is well suitable to environments where end-clients in numerous capacities


require access to both summarized information for up to the minute tactical decisions as well as
summarized, a commutative record for long-term strategic decisions. Both the Operational Data
Store (ODS) and the data warehouse may reside on host-based or LAN Based databases, depending
on volume and custom requirements. These contain DB2, Oracle, Informix, IMS, Flat Files, and
Sybase.

Usually, the ODS stores only the most up-to-date records. The data warehouse stores the historical
calculation of the files. At first, the information in both databases will be very similar. For example,
the records for a new client will look the same. As changes to the user record occur, the ODs will
be refreshed to reflect only the most current data, whereas the data warehouse will contain both the
historical data and the new information. Thus the volume requirement of the data warehouse will
exceed the volume requirements of the ODS overtime. It is not familiar to reach a ratio of 4 to 1 in
practice.

 Stationary Data Warehouses

In this type of data warehouses, the data is not changed from the sources, as shown in fig:

Dr. Nikhat Akhtar Page 47


Instead, the customer is given direct access to the data. For many organizations, infrequent access,
volume issues, or corporate necessities dictate such as approach. This schema does generate several
problems for the customer such as

 Identifying the location of the information for the users


 Providing clients the ability to query different DBMSs as is they were all a single DBMS with a
single API.
 Impacting performance since the customer will be competing with the production data stores.

Such a warehouse will need highly specialized and sophisticated 'middleware' possibly with a
single interaction with the client. This may also be essential for a facility to display the extracted
record for the user before report generation. An integrated metadata repository becomes an absolute
essential under this environment.

 Distributed Data Warehouses

The concept of a distributed data warehouse suggests that there are two types of distributed data
warehouses and their modifications for the local enterprise warehouses which are distributed
throughout the enterprise and a global warehouses as shown in fig:

Characteristics of Local data warehouses

 Activity appears at the local level


 Bulk of the operational processing
 Local site is autonomous
 Each local data warehouse has its unique architecture and contents of data
 The data is unique and of prime essential to that locality only
 Majority of the record is local and not replicated
 Any intersection of data between local data warehouses is circumstantial
 Local warehouse serves different technical communities
 The scope of the local data warehouses is finite to the local site

Dr. Nikhat Akhtar Page 48


 Local warehouses also include historical data and are integrated only within the local site.

 Virtual Data Warehouses

Virtual Data Warehouses is created in the following stages:

1. Installing a set of data approach, data dictionary, and process management facilities.
2. Training end-clients.
3. Monitoring how DW facilities will be used
4. Based upon actual usage, physically Data Warehouse is created to provide the high-frequency results

This strategy defines that end users are allowed to get at operational databases directly using
whatever tools are implemented to the data access network. This method provides ultimate
flexibility as well as the minimum amount of redundant information that must be loaded and
maintained. The data warehouse is a great idea, but it is difficult to build and requires investment.
Why not use a cheap and fast method by eliminating the transformation phase of repositories for
metadata and another database. This method is termed the 'virtual data warehouse.'

To accomplish this, there is a need to define four kinds of data:

1. A data dictionary including the definitions of the various databases.


2. A description of the relationship between the data components.
3. The description of the method user will interface with the system.
4. The algorithms and business rules that describe what to do and how to do it.

Disadvantages

1. Since queries compete with production record transactions, performance can be degraded.
2. There is no metadata, no summary record, or no individual DSS (Decision Support System)
integration or history. All queries must be copied, causing an additional burden on the system.
3. There is no refreshing process, causing the queries to be very complex.

Dr. Nikhat Akhtar Page 49


 Web Mining

Web Mining is the process of Data Mining techniques to automatically discover and extract
information from Web documents and services. The main purpose of web mining is to discover useful
information from the World Wide Web and its usage patterns.

 Data Mining

Web mining is the best type of practice for sifting through the vast amount of data in the system that
is available on the World Wide Web to find and extract pertinent information as per requirements.
One unique feature of web mining is its ability to deliver a wide range of required data types in the
actual process. There are various elements of the web that lead to diverse methods for the actual
mining process. For example, web pages are made up of text; they are connected by hyperlinks in the
system or process; and web server logs allow for the monitoring of user behavior to simplify all the
required systems. Combining all the required methods from data mining, machine learning, artificial
intelligence, statistics, and information retrieval, web mining is an interdisciplinary field for the
overall system. Analyzing user behavior and website traffic is the one basic type or example of web
mining.

 Applications of Web Mining

Web mining is the process of discovering patterns, structures, and relationships in web data. It
involves using data mining techniques to analyze web data and extract valuable insights. The
applications of web mining are wide-ranging and include:

 Personalized marketing:Web mining can be used to analyze customer behavior on websites and
social media platforms. This information can be used to create personalized marketing campaigns
that target customers based on their interests and preferences.

 E-commerce: Web mining can be used to analyze customer behavior on e-commerce websites. This
information can be used to improve the user experience and increase sales by recommending
products based on customer preferences.

 Search engine optimization: Web mining can be used to analyze search engine queries and search
engine results pages (SERPs). This information can be used to improve the visibility of websites in
search engine results and increase traffic to the website.

 Fraud detection: Web mining can be used to detect fraudulent activity on websites. This
information can be used to prevent financial fraud, identity theft, and other types of online fraud.

 Sentiment analysis: Web mining can be used to analyze social media data and extract sentiment
from posts, comments, and reviews. This information can be used to understand customer sentiment
towards products and services and make informed business decisions.

 Web content analysis: Web mining can be used to analyze web content and extract valuable
information such as keywords, topics, and themes. This information can be used to improve the
relevance of web content and optimize search engine rankings.

 Customer service: Web mining can be used to analyze customer service interactions on websites
and social media platforms. This information can be used to improve the quality of customer service
and identify areas for improvement.

Dr. Nikhat Akhtar Page 50


 Healthcare: Web mining can be used to analyze health-related websites and extract valuable
information about diseases, treatments, and medications. This information can be used to improve
the quality of healthcare and inform medical research.

 Process of Web Mining

Web Mining Process

Web mining can be broadly divided into three different types of techniques of mining: Web
Content Mining, Web Structure Mining, and Web Usage Mining. These are explained as following
below.

Categories of Web Mining

 Web Content Mining: Web content mining is the application of extracting useful information from
the content of the web documents. Web content consist of several types of data – text, image, audio,
video etc. Content data is the group of facts that a web page is designed. It can provide effective and
interesting patterns about user needs. Text documents are related to text mining, machine learning
and natural language processing. This mining is also known as text mining. This type of mining
performs scanning and mining of the text, images and groups of web pages according to the content
of the input.

 Web Structure Mining: Web structure mining is the application of discovering structure
information from the web. The structure of the web graph consists of web pages as nodes, and
hyperlinks as edges connecting related pages. Structure mining basically shows the structured
summary of a particular website. It identifies relationship between web pages linked by information
or direct link connection. To determine the connection between two commercial websites, Web
structure mining can be very useful.

 Web Usage Mining: Web usage mining is the application of identifying or discovering interesting
usage patterns from large data sets. And these patterns enable you to understand the user behaviors

Dr. Nikhat Akhtar Page 51


or something like that. In web usage mining, user access data on the web and collect data in form of
logs. So, Web usage mining is also called log mining.

 Challenges of Web Mining

 Complexity of required web pages: Basically, there is no cohesive framework throughout the site’s
pages so when compared to conventional text, they are incredibly intricate in the process. The web’s
digital library contains a vast number of documents in the actual system. There is no set order in
which these libraries are typically arranged for the user.

 Dynamic data source in the internet: The required online data is updated in real time. For
instance, news, weather, fashion, finance, sports, and so forth is not possible to indicate properly.

 Data relevancy: It is much believed that a particular person is typically only concerned with a
limited percentage of the internet throughout the process, with the remaining portion containing data
that may provide unexpected outcomes for the actual requirement and is unfamiliar to the user to
verify.

 Too much large web: Basically, the web is getting bigger and bigger very quickly in the system.
The web seems to be too big for data mining and data warehousing as per requirement.

 Comparison between Data Mining and Web Mining

Parameters Data Mining Web Mining

Data Mining is the process that Web Mining is the process of data
attempts to discover pattern and mining techniques to automatically
Definition
hidden knowledge in large data sets discover and extract information
in any system. from web documents.

Data Mining is very useful for web Web Mining is very useful for a
Application
page analysis. particular website and e-service.

Data scientists along with data


Target Users Data scientist and data engineers.
analysts.

In Web Mining get the information


In Data Mining get the information
Structure from structured, unstructured and
from explicit structure.
semi-structured web pages.

Clustering, classification,
Web content mining, Web structure
Problem Type regression, prediction, optimization
mining.
and control.

It includes tools like machine Special tools for web mining are
Tools
learning algorithms. Scrapy, PageRank and Apache logs.

Skills It includes approaches for data It includes application level


cleansing, machine learning knowledge, data engineering with

Dr. Nikhat Akhtar Page 52


Parameters Data Mining Web Mining

algorithms. Statistics and mathematical modules like


probability. statistics and probability.

 Difference between Spatial and Temporal Data Mining

Spatial data mining refers to the process of extraction of knowledge, spatial relationships and
interesting patterns that are not specifically stored in a spatial database; on the other hand, temporal
data mining refers to the process of extraction of knowledge about the occurrence of an event
whether they follow, random, cyclic, seasonal variation, etc. Spatial means space, whereas temporal
means time. In this article, we will learn Spatial and temporal data mining separately; after that, we
will discuss the difference between them.

 Spatial Data Mining

The emergence of spatial data and extensive usage of spatial databases has led to spatial knowledge
discovery. Spatial data mining can be understood as a process that determines some exciting and
hypothetically valuable patterns from spatial databases.

Several tools are there that assist in extracting information from geospatial data. These tools play a
vital role for organizations like NASA, the National Imagery and Mapping Agency (NIMA), the
National Cancer Institute (NCI), and the United States Department of Transportation (USDOT)
which tends to make big decisions based on large spatial datasets.

Earlier, some general-purpose data mining like Clementine See5/C5.0, and Enterprise Miner were
used. These tools were utilized to analyse large commercial databases, and these tools were mainly
designed for understanding the buying patterns of all customers from the database.

Besides, the general-purpose tools were preferably used to analyze scientific and engineering data,
astronomical data, multimedia data, genomic data, and web data.

These are the given specific features of geographical data that prevent the use of general-purpose
data mining algorithms are:

1. spatial relationships among the variables,


2. spatial structure of errors
3. observations that are not independent
4. spatial autocorrelation among the features
5. non-linear interaction in feature space.

Spatial data must have latitude or longitude, UTM easting or northing, or some other coordinates
denoting a point's location in space. Beyond that, spatial data can contain any number of attributes
pertaining to a place. You can choose the types of attributes you want to describe a place.
Government websites provide a resource by offering spatial data, but you need not be limited to
what they have produced. You can produce your own.

Dr. Nikhat Akhtar Page 53


Say, for example, you wanted to log information about every location you've visited in the past
week. This might be useful to provide insight into your daily habits. You could capture your
destination's coordinates and list a number of attributes such as place name, the purpose of visit,
duration of visit, and more. You can then create a shapefile in Quantum GIS or similar software
with this information and use the software to query and visualize the data. For example, you could
generate a heatmap of the most visited places or select all places you've visited within a radius of 8
miles from home.

Any data can be made spatial if it can be linked to a location, and one can even have spatiotemporal
data linked to locations in both space and time. For example, when geolocating tweets from Twitter
in the aftermath of a disaster, an animation might be generated that shows the spread of tweets from
the epicentre of the event.

 Spatial data mining tasks

These are the primary tasks of spatial data mining.

Classification:

Classification determines a set of rules which find the class of the specified object as per its
attributes.

Association rules:

Association rules determine rules from the data sets, and it describes patterns that are usually in the
database.

Characteristic rules:

Characteristic rules describe some parts of the data set.

Discriminate rules:

As the name suggests, discriminate rules describe the differences between two parts of the database,
such as calculating the difference between two cities as per employment rate.

 Temporal Data Mining


Dr. Nikhat Akhtar Page 54
Temporal data mining refers to the process of extraction of non-trivial, implicit, and potentially
important data from huge sets of temporal data. Temporal data are sequences of a primary data
type, usually numerical values, and it deals with gathering useful knowledge from temporal data.

With the increase of stored data, the interest in finding hidden data has shattered in the last decade.
The finding of hidden data has primarily been focused on classifying data, finding relationships,
and data clustering. The major drawback that comes during the discovery process is treating data
with temporal dependencies. The attributes related to the temporal data present in this type of
dataset must be treated differently from other types of attributes. Therefore, most data mining
techniques treat temporal data as an unordered collection of events, ignoring its temporal data.

 Temporal data mining tasks

 Data characterization and comparison


 Cluster Analysis
 Classification
 Association rules
 Prediction and trend analysis
 Pattern Analysis

 Difference between spatial and Temporal data mining

Spatial Data Mining Temporal Data Mining

Spatial data mining refers to the extraction temporal data mining refers to the process of
of knowledge, spatial relationships and extraction of knowledge about the occurrence of an
interesting patterns that are not specifically event whether they follow, random, cyclic, seasonal
stored in a spatial database. variation, etc

It needs space. It needs time.

Primarily, it deals with spatial data such as Primarily, it deals with implicit and explicit temporal
location, geo-referenced. content, form a huge set of data.

It involves characteristic rules, discriminant


It targets mining new patterns and unknown
rules, evaluation rules, and association
knowledge, which takes the temporal aspects of data.
rules.

Examples: An association rules which seems - "Any


Examples: Finding hotspots, unusual person who buys motorcycle also buys helmet". By
locations. temporal aspect, this rule would be - "Any person
who buys a motorcycle also buy a helmet after that."

Dr. Nikhat Akhtar Page 55

You might also like