0% found this document useful (0 votes)
208 views65 pages

CS201 17data Mining

This document discusses objectives and concepts related to online analytical processing (OLAP), data warehousing, and data mining. It covers key features of OLAP applications including multi-dimensional views of data, complex calculations, and time intelligence. The document also discusses OLAP extensions to SQL, multi-dimensional OLAP servers that use cube structures to store and retrieve data, and different types of data mining operations.

Uploaded by

Paida Heart
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPT, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
208 views65 pages

CS201 17data Mining

This document discusses objectives and concepts related to online analytical processing (OLAP), data warehousing, and data mining. It covers key features of OLAP applications including multi-dimensional views of data, complex calculations, and time intelligence. The document also discusses OLAP extensions to SQL, multi-dimensional OLAP servers that use cube structures to store and retrieve data, and different types of data mining operations.

Uploaded by

Paida Heart
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPT, PDF, TXT or read online on Scribd
You are on page 1/ 65

Introduction to

Data Mining

09/15/15

HCT303 Application of Database systems

Objectives
Purpose of online analytical processing (OLAP)

and how OLAP differs from data warehousing.

Key features of OLAP applications.


Potential benefits associated with successful

OLAP applications.

Rules for OLAP tools and main types of tools

including: multi-dimensional OLAP (MOLAP),


relational OLAP (ROLAP), and managed query
environment (MQE).

09/15/15

HCT303 Application of Database systems

Objectives
OLAP extensions to SQL.
Concepts associated with data mining.
Main data mining operations including

predictive modeling, database


segmentation, link analysis, and
deviation detection.

Relationship between data mining and

data warehousing.

09/15/15

HCT303 Application of Database systems

Acknowledgments
These slides have been adapted from Thomas

Connolly and Carolyn Begg

09/15/15

HCT303 Application of Database systems

Data Warehousing and


End-User
Access
Accompanying growth
in data Tools
warehouses is increasing demands for
more powerful access tools providing
advanced analytical capabilities.
Key developments include:
Online analytical processing (OLAP).
SQL extensions for complex data analysis.
Data mining tools.

09/15/15

HCT303 Application of Database systems

Introducing OLAP
The dynamic synthesis, analysis, and

consolidation of large volumes of


multi-dimensional data, Codd (1993).
Describes a technology that uses a

multi-dimensional view of aggregate


data to provide quick access to
strategic information for purposes of
advanced analysis.

09/15/15

HCT303 Application of Database systems

Introducing OLAP
Enables users to gain a deeper

understanding and knowledge about


various aspects of their corporate data
through fast, consistent, interactive
access to a wide variety of possible views
of the data.
Allows users to view corporate data in

such a way that it is a better model of the


true dimensionality of the enterprise.

09/15/15

HCT303 Application of Database systems

Introducing OLAP
Can easily answer who? and what?

questions, however, ability to answer what


if? and why? type questions distinguishes
OLAP from general-purpose query tools.
Types of analysis ranges from basic

navigation and browsing (slicing and


dicing) to calculations, to more complex
analyses such as time series and complex
modeling.

09/15/15

HCT303 Application of Database systems

OLAP Applications
Just-In-Time (JIT) information is

computed data that usually reflects


complex relationships and is often
calculated on the fly.
Also, as data relationships may not be
known in advance, the data model
must be flexible.

09/15/15

HCT303 Application of Database systems

Examples of OLAP
Applications in
Various Functional
Areas

09/15/15

HCT303 Application of Database systems

10

OLAP Applications
AlthoughOLAPapplicationsarefoundinwidely

divergentfunctionalareas,allhavefollowingkey
features:
multidimensionalviewsofdata;
supportforcomplexcalculations;
time intelligence.

09/15/15

HCT303 Application of Database systems

11

Representing MultiDimensional
Data
Example of two-dimensional
query.
What is the total revenue generated by property

sales in each city, in each quarter of 1997?

Choice of representation is based on

types of queries end-user may ask.


Compare representation - three-field

relational table versus twodimensional matrix.

09/15/15

HCT303 Application of Database systems

12

Multi-Dimensional Data as
Three-Field Table versus
Two-Dimensional Matrix

09/15/15

HCT303 Application of Database systems

13

Representing MultiDimensional
Dataquery.
Example of three-dimensional
What is the total revenue generated

by property sales for each type of


property (Flat or House) in each city, in
each quarter of 1997?

Compare representation - four-field

relational table versus threedimensional cube.

09/15/15

HCT303 Application of Database systems

14

Multi-Dimensional
Data as Four-Field
Table versus ThreeDimensional Cube

09/15/15

HCT303 Application of Database systems

15

Representing MultiDimensional
Data
Cube represents data as
cells in an
array.
Relational table only represents multi-

dimensional data in two dimensions.

09/15/15

HCT303 Application of Database systems

16

Multi-Dimensional
OLAP
Servers
Use multi-dimensional
structures to store
data and relationships between data.

Multi-dimensional structures are best

visualized as cubes of data, and cubes


within cubes of data. Each side of cube is
a dimension.
A cube can be expanded to include other

dimensions.

09/15/15

HCT303 Application of Database systems

17

Multi-Dimensional
A cube supports
matrix arithmetic.
OLAP
Servers
Multi-dimensional query response

time depends on how many cells have


to be added on the fly.

As number of dimensions increases,

number of the cubes cells increases


exponentially.

09/15/15

HCT303 Application of Database systems

18

Multi-Dimensional
However, majority of multi-dimensional
OLAP
Servers
queries use summarized, high-level data.
Solution is to pre-aggregate

(consolidate) all logical subtotals and


totals along all dimensions.

Pre-aggregation is valuable, as typical

dimensions are hierarchical in nature.


(e.g. Time dimension hierarchy - years, quarters,

months, weeks, and days)

09/15/15

HCT303 Application of Database systems

19

Multi-Dimensional
OLAP
Servers
Predefined
hierarchy allows logical
pre-aggregation and, conversely,
allows for a logical drill-down.
Supports common analytical

operations
Consolidation.
Drill-down.
Slicing and dicing.

09/15/15

HCT303 Application of Database systems

20

Multi-Dimensional OLAP
Consolidation - aggregation of data such
Servers
as simple roll-ups or complex
expressions involving inter-related data.

Drill-Down - is reverse of consolidation

and involves displaying the detailed data


that comprises the consolidated data.

Slicing and Dicing - (also called pivoting)

refers to the ability to look at the data


from different viewpoints.

09/15/15

HCT303 Application of Database systems

21

Can store data in a compressed form by


Multi-Dimensional
dynamically selecting physical storage
organizations and compression techniques
OLAP
servers
that maximize space utilization.
Dense data (i.e., data that exists for high

percentage of cells) can be stored


separately from sparse data (i.e.,
significant percentage of cells are empty).

09/15/15

HCT303 Application of Database systems

22

Multi-Dimensional OLAP
Ability to omit empty or repetitive
Servers
cells can greatly reduce the size of the
cube and the amount of processing.
Allows analysis of exceptionally large

amounts of data.

09/15/15

HCT303 Application of Database systems

23

Multi-Dimensional OLAP
In summary, pre-aggregation, dimensional
Servers
hierarchy, and sparse data management
can significantly reduce the size of the
cube and the need to calculate values onthe-fly.
Removes need for multi-table joins and

provides quick and direct access to arrays


of data, thus significantly speeding up
execution of multi-dimensional queries.

09/15/15

HCT303 Application of Database systems

24

OLAP Extensions to
SQL promoted as easy to learn, nonSQL
procedural, free-format, DBMS-

independent, and international standard.

However, major disadvantage has been

inability to represent many of the questions


most commonly asked by business analysts.

IBM and Oracle jointly proposed OLAP

extensions to SQL early in 1999, adopted as


an amendment to SQL.

09/15/15

HCT303 Application of Database systems

25

OLAP Extensions to
Many database vendors including IBM,
SQL
Oracle, Informix, and Red Brick Systems
have already implemented portions of
specifications in their DBMSs.
Red Brick Systems was first to

implement many essential OLAP


functions (as Red Brick Intelligent SQL
(RISQL)), albeit in advance of the
standard.

09/15/15

HCT303 Application of Database systems

26

OLAP Extensions to
SQL
- RISQL
Designed
for business analysts.
Set of extensions that augments SQL

with a variety of powerful operations


appropriate to data analysis and
decision-support applications such as
ranking, moving averages,
comparisons, market share, this year
versus last year.

09/15/15

HCT303 Application of Database systems

27

Use of the RISQL CUME


Show the quarterly sales for branch
Function
office B003, along with the monthly
year-to-date figures.
SELECTquarter,quarterlySales,CUME(quarterlySales)
ASYeartoDate
FROMBranchSales
WHEREbranchNo=B003;

09/15/15

HCT303 Application of Database systems

28

Use of the RISQL


MOVINGAVG / MOVINGSUM
Show the first six monthly sales
Function
for branch office B003 without
the effect of seasonality.

SELECTmonth,monthlySales,
MOVINGAVG(monthlySales)AS3MonthMovingAvg,
MOVINGSUM(monthlySales)AS3MonthMovingSum
FROMBranchSales
WHEREbranchNo=B003;

09/15/15

HCT303 Application of Database systems

29

Data
Mining
previously unknown, comprehensible,
The process of extracting valid,

and actionable information from large


databases and using it to make crucial
business decisions (Simoudis, 1996).
Involves analysis of data and use of

software techniques for finding hidden


and unexpected patterns and
relationships in sets of data.

09/15/15

HCT303 Application of Database systems

30

Data
Mining
unexpected, as little value in finding patterns
Reveals information that is hidden and

and relationships that are already intuitive.

Patterns and relationships are identified by

examining the underlying rules and features


in the data.

Tends to work from the data up and most

accurate results normally require large


volumes of data to deliver reliable
conclusions.

09/15/15

HCT303 Application of Database systems

31

Data
Mining
representation of structure of sample data,
Starts by developing an optimal

during which time knowledge is acquired


and extended to larger sets of data.

Data mining can provide huge paybacks for

companies who have made a significant


investment in data warehousing.

Relatively new technology, however

already used in a number of industries.

09/15/15

HCT303 Application of Database systems

32

Examples of
Applications of Data
Mining

Retail / Marketing
Identifying buying patterns of
customers.
Finding associations among customer
demographic characteristics.
Predicting response to mailing
campaigns.
Market basket analysis.

09/15/15

HCT303 Application of Database systems

33

Examples of
Banking
Applications
of
Data
Detecting patterns of fraudulent credit
card use.
Mining
Identifying loyal customers.

Predicting customers likely to change

their credit card affiliation.


Determining credit card spending by
customer groups.

09/15/15

HCT303 Application of Database systems

34

Examples of
Applications of Data
Mining
Insurance
Claims analysis.
Predicting which customers will buy
new policies.
Medicine
Characterizing patient behavior to
predict surgery visits.
Identifying successful medical
therapies for different illnesses.

09/15/15

HCT303 Application of Database systems

35

Data Mining
Operations

Four main operations include:


Predictive modeling.
Database segmentation.
Link analysis.
Deviation detection.
There are recognized associations between

the applications and the corresponding


operations.
e.g. Direct marketing strategies use

database segmentation.

09/15/15

HCT303 Application of Database systems

36

Data
Mining
implementations of the data mining
operations.
Techniques
Techniques are specific

Each operation has its own strengths

and weaknesses.

Data mining tools sometimes offer a

choice of operations to implement a


technique.

09/15/15

HCT303 Application of Database systems

37

Data Mining
Criteria for selection of tool includes
Techniques
Suitability for certain input data types.
Transparency of the mining output.
Tolerance of missing variable values.
Level of accuracy possible.
Ability to handle large volumes of data.

09/15/15

HCT303 Application of Database systems

38

Data Mining
Operations and
Associated Techniques

09/15/15

HCT303 Application of Database systems

39

Predictive Modeling

Similar to the human learning experience


uses observations to form a model of the important

characteristics of some phenomenon.

Uses generalizations of real world and

ability to fit new data into a general


framework.
Can analyze a database to determine

essential characteristics (model) about


the data set.

09/15/15

HCT303 Application of Database systems

40

Predictive
Modeling
Model is developed using a supervised
learning approach, which has two
phases: training and testing.
Training builds a model using a large

sample of historical data called a


training set.
Testing involves trying out the model
on new, previously unseen data to
determine its accuracy and physical
performance characteristics.

09/15/15

HCT303 Application of Database systems

41

Predictive
Modeling
Applications of predictive modeling
include customer retention
management, credit approval, cross
selling, and direct marketing.
Two techniques associated with

predictive modeling: classification and


value prediction, distinguished by
nature of the variable being predicted.

09/15/15

HCT303 Application of Database systems

42

Predictive Modeling
Used to establish a specific
- Classification
predetermined class for each record in
a database from a finite set of
possible class values.
Two specializations of classification:

tree induction and neural induction.

09/15/15

HCT303 Application of Database systems

43

Example of
Classification using
Tree Induction

09/15/15

HCT303 Application of Database systems

44

Example of
Classification using
Neural Induction

09/15/15

HCT303 Application of Database systems

45

Predictive Modeling
Used to estimate a continuous numeric
-value
Value
Prediction
that is associated
with a
database record.

Uses the traditional statistical

techniques of linear regression and


nonlinear regression.
Relatively easy to use and understand.

09/15/15

HCT303 Application of Database systems

46

Predictive Modeling
Linear regression attempts to fit a straight
-lineValue
Prediction
through a plot
of the data, such that
the line is the best representation of the
average of all observations at that point in
the plot.
Problem is that the technique only works

well with linear data and is sensitive to the


presence of outliers (i.e., data values,
which do not conform to the expected
norm).

09/15/15

HCT303 Application of Database systems

47

Predictive Modeling
Although nonlinear regression avoids
-theValue
Prediction
main problems of linear regression,
still not flexible enough to handle all
possible shapes of the data plot.
Statistical measurements are fine for

building linear models that describe


predictable data points, however, most
data is not linear in nature.

09/15/15

HCT303 Application of Database systems

48

Predictive Modeling
Data mining requires statistical
-methods
Value
Prediction
that can accommodate nonlinearity, outliers, and non-numeric
data.
Applications of value prediction

include credit card fraud detection or


target mailing list identification.

09/15/15

HCT303 Application of Database systems

49

Database
Aim is to partition a database into an
Segmentation
unknown number of segments, or
clusters, of similar records.

Uses unsupervised learning to

discover homogeneous subpopulations in a database to improve


the accuracy of the profiles.

09/15/15

HCT303 Application of Database systems

50

Database
Less precise than other operations thus less
sensitive
to redundant and irrelevant
Segmentation
features.
Sensitivity can be reduced by ignoring a

subset of the attributes that describe each


instance or by assigning a weighting factor to
each variable.

Applications of database segmentation

include customer profiling, direct marketing,


and cross selling.

09/15/15

HCT303 Application of Database systems

51

Example of Database
Segmentation using a
Scatterplot

09/15/15

HCT303 Application of Database systems

52

Associated with demographic or


Database
neural clustering techniques,
distinguished by:
Segmentation
Allowable data inputs.

Methods used to calculate the distance

between records.
Presentation of the resulting segments
for analysis.

09/15/15

HCT303 Application of Database systems

53

Link Analysis

Aims to establish links (associations)

between records, or sets of records, in a


database.

There are three specializations


Associations discovery.
Sequential pattern discovery.
Similar time sequence discovery.

Applications include product affinity

analysis, direct marketing, and stock price


movement.

09/15/15

HCT303 Application of Database systems

54

Link Analysis Finds items that imply the presence of


Associations
other items in the same event.
Discovery

Affinities between items are represented


by association rules.
e.g. When customer rents property for

more than 2 years and is more than 25


years old, in 40% of cases, customer will
buy a property. Association happens in
35% of all customers who rent properties.

09/15/15

HCT303 Application of Database systems

55

Link
Analysis
Finds patterns between events such
that the presence ofPattern
one set of items
Sequential
is followed by another set of items in
Discovery
a database of events over a period of
time.

e.g. Used to understand long-term

customer buying behavior.

09/15/15

HCT303 Application of Database systems

56

Link Analysis - Similar


Finds links between two sets of data
Time
that areSequence
time-dependent, and is based
on the degree of similarity between
Discovery
the patterns that both time series
demonstrate.

e.g. Within three months of buying

property, new home owners will


purchase goods such as cookers,
freezers, and washing machines.

09/15/15

HCT303 Application of Database systems

57

Deviation
Detection
commercially available data mining

Relatively new operation in terms of

tools.

Often a source of true discovery

because it identifies outliers, which


express deviation from some
previously known expectation and
norm.

09/15/15

HCT303 Application of Database systems

58

Deviation
Detection

Can be performed using statistics and


visualization techniques or as a byproduct of data mining.

Applications include fraud detection

in the use of credit cards and


insurance claims, quality control, and
defects tracing.

09/15/15

HCT303 Application of Database systems

59

Example of Database
Segmentation using a
Visualization

09/15/15

HCT303 Application of Database systems

60

There are
a growing number
of
Data
Mining
Tools
commercial data mining tools on the

marketplace.

Important characteristics of data

mining tools include:

Data preparation facilities.


Selection of data mining operations.
Product scalability and performance.
Facilities for visualization of results.

09/15/15

HCT303 Application of Database systems

61

Data Mining and


Major challenge to exploit data mining
Data
Warehousing
is identifying
suitable data to mine.
Data mining requires single, separate,

clean, integrated, and self-consistent


source of data.

09/15/15

HCT303 Application of Database systems

62

Data Mining and


A data warehouse is well equipped for
Data
providing Warehousing
data for mining.
Data quality and consistency is a

prerequisite for mining to ensure the


accuracy of the predictive models.
Data warehouses are populated with
clean, consistent data.

09/15/15

HCT303 Application of Database systems

63

Data Mining and Data


Advantageous to mine data from
Warehousing
multiple sources to discover as many
interrelationships as possible. Data
warehouses contain data from a
number of sources.
Selecting relevant subsets of records

and fields for data mining requires


query capabilities of the data
warehouse.

09/15/15

HCT303 Application of Database systems

64

Data Mining and Data


Results of a data mining study are
Warehousing
useful if there is some way to further
investigate the uncovered patterns.
Data warehouses provide capability to
go back to the data source.

09/15/15

HCT303 Application of Database systems

65

You might also like