Data Warehouse
Fundamentals
Chapter 9
Data Mining Basics
Instructor: Paul Chen
Topics
1. How Data Mining Evolved?
2. Decision Processing Overview and Tasks
3. Data Mining, Whats it?
4. Data Mining vs. Data Warehousing
5. How Data Mining Works? And Its Applications
6. Data Mining Operations and Associated Techniques
7. The Data Mining Process
8. Data Mining Tools
9. Data Mining Techniques- A Summary
Topic 1:How Data Mining Evolved?
Many businesses have invested heavily in information
technology to help them manage their businesses more
effectively and gain a competitive edge. Increasingly large
amounts of critical business data are being stored
electronically and this volume is expected to continue to
grow. The Data Mining technology is helping companies
leverage their existing data more effectively and obtain
insightful information giving them a competitive edge.
How Data Mining Evolved?
1960s 1990s Late 1990s to
1970s-80s
Data OLAP and Now
RDBMS Data Mining
Collection DW
Time Line
Topic 2: Decision Processing
Overview
Decision processing systems, and their underlying
analytical applications, provide business users with the
information they need to track and analyze business
trends, and to explore new business opportunities. As
businesses become increasingly competitive and
complex, effective decision processing systems are
essential for success.
The Next Generation of Business
Intelligence
A decision processing system analyzes business
information captured from operational systems (Back-
and-front office, and e-business applications).
Distribution of business information to business users
is via corporate intranets and extranets.
The flow of data can be thought of as an information
supply chain whose objective is to convert operational
data into useful business information.
The Decision Processing Information
Business
Supply Chain Metrics
Operational
Systems
External Analytic
E-Business Data Applications
Applications
Collaborative
DW &
Back-Office Office Systems
Transaction Business
Applications Intelligence
Information Tools
Staging
Area
Business
Front-Office Decisions
Applications
Decision ProcessingFour Tasks***
Extracting and transforming information
This involves capturing data from operational systems,
transforming it into business information, and loading
Into a data warehouse information store.
Current extract templates on the market are primarily at
Capturing data from ERP (Enterprise Resource Planning)
Transaction processing systems for example: SAP Business
Information Warehouse and Peoplesoft BPM data warehouse)
*** Mentioned in chapter 2
Decision ProcessingFour Tasks
(Contd)
Managing information
This task encompasses the maintenance of business
information in information stores, and how these
information stores are processed by business intelligence
tools and analytic applications.
The cornerstone of decision processing is data
warehousing, and warehouse information stores should
be organized and modeled into relational and
multidimensional database products.
Decision ProcessingFour Tasks
(Contd)
Analyzing and modeling information
The traditional approach to decision
processing is to build a data warehouse
and supply business users with a set of
business intelligence tools (query,
reporting, OLAP and data mining, for
example) to process information in data
warehouse information stores.
A better approach is employ turn-key and
web-based analytic application packages
that are designed to provide
comprehensive analyses for the business
area being researched. Key business
metrics (ex. Revenue dollars per sales rep
per day) are useful.
Decision ProcessingFour Tasks
(Contd)
Distributing information
Business intelligence tools and analytic applications distribute
information and the results of analysis operations to business
users via standard graphical and Web interfaces.
To help users uncover and organize this range of business
information, an enterprise information portal (EIP) is required.
An EIP provides a single point of entry to any piece of
business information, no matter where it resides.
The main components of an EIP are information assistant
(Web browser interface) , an information directory and a
subscription facility.
Decision Making Under Risk
Decisions are made under three sets of conditions:
Certainty
The decision makers know everything in advance
of making the decision
Uncertainty
The decision makers know nothing about the
probabilities or the consequences of decisions
Risk
Decision-Making Style
Decision-making styles of users are categorized as
either
Analytic or
Heuristic
Analytic and Heuristic Decision
Making
Analytical Decision Maker Heuristic Decision Maker
Learns by analyzing Learns by acting
Uses step-by-step procedure Uses trial and error
Values quantitative Values experiences
information and models Relies on common sense
Builds mathematical models Seeks completely satisfying
and algorithms solution
Seeks optimal solution
Topic 3: Data Mining, Whats it?
Data Mining has been defined as a decision support
process in which a search is made for patterns of
information in data. To detect patterns in data, Data
Mining uses sophisticated statistical analysis and modeling
technologies to uncover useful relationships hidden in
databases. It predicts future trends and finds behavior
allowing businesses to make predictive, knowledge-driven
decisions.
Data Mining, Whats it?
The process of extracting valid, previously unknown,
comprehensible, and actionable information from large
databases and using it to make crucial business
decisions, (Simoudis,1996).
Involves analysis of data and use of software techniques
for finding hidden and unexpected patterns and
relationships in sets of data.
Data Mining, Whats it?
Reveals information that is hidden and unexpected, as
little value in finding patterns and relationships that
are already intuitive.
Patterns and relationships are identified by examining
the underlying rules and features in the data.
Tends to work from the data up and most accurate
results normally require large volumes of data to
deliver reliable conclusions.
Data Mining, Whats it?
Starts by developing an optimal representation of
structure of sample data, during which time knowledge
is acquired and extended to larger sets of data.
Data mining can provide huge paybacks for companies
who have made a significant investment in data
warehousing.
Relatively new technology, however already used in a
number of industries.
Topic 4: Data Mining vs. Data
Warehousing
Data Mining does not require that a Data Warehouse be
built. Often, data can be downloaded from the operational
files to flat files that contain the data ready for the data
mining analysis.
Data Mining can be implemented rapidly on existing
software and hardware platforms. Data Mining tools can
analyze massive databases to deliver answers to questions
such as, Which customers are most likely to respond to
my next promotional mailing, and why?
Data Mining vs. Data
Warehousing
Major challenge to exploit data mining is identifying suitable data
to mine.
Data mining requires single, separate, clean, integrated, and self-
consistent source of data.
A data warehouse is well equipped for providing data for mining.
Data quality and consistency is a pre-requisite for mining to
ensure the accuracy of the predictive models. Data warehouses are
populated with clean, consistent data.
Data Mining vs. Data
Warehousing
Advantageous to mine data from multiple sources to discover as
many interrelationships as possible. Data warehouses contain data
from a number of sources.
Selecting relevant subsets of records and fields for data mining
requires query capabilities of the data warehouse.
Results of a data mining study are useful if there is some way to
further investigate the uncovered patterns. Data warehouses
provide capability to go back to the data source.
Topic 5: How Data Mining
Works?
How exactly is Data Mining able to tell you important
things that you didnt know or what is going to happen
next? The technique in Data Mining is called Predictive
Modeling which is knowledge discovery process via
relationships and patterns in broad sense.
Modeling is the act of building a model in one situation
where you know the answer and then applying it to another
situation that you dont.
Examples of Applications of Data
Mining via relationships and patterns
Retail / Marketing
Identifying buying patterns of customers
Finding associations among customer demographic
characteristics
Predicting response to mailing campaigns
Market basket analysis
Examples of Applications of Data
Mining via relationships and patterns
Banking
Detecting patterns of fraudulent credit card use
Identifying loyal customers
Predicting customers likely to change their credit
card affiliation
Determining credit card spending by customer
groups
Examples of Applications of Data
Mining via relationships and patterns
Insurance
Claims analysis
Predicting which customers will buy new policies.
Medicine
Characterizing patient behaviour to predict surgery
visits
Identifying successful medical therapies for
different illnesses.
Examples of Applications of Data
Mining via relationships and patterns
Customer profiling: characteristics of good customers are
identified with the goals of predicting who will become
one and helping marketers target new prospects.
Targeting specific marketing promotions to existing and
potential customers offers similar benefits.
Market-basket analysis: With Data Mining, companies can
determine which products to stock in which stores, and
even how to place them within a store.
Examples of Applications of Data
Mining via relationships and patterns
Customer Relationships Management-Determines
characteristics of customers who are likely to leave for a
competitor, a company can take action to retain that
customer because doing so is usually for less expensive
than acquiring a new customer.
Fraud detection- With Data Mining, companies can
identify potentially fraudulent transactions before they
happen.
Topic 6: Data Mining Operations
and Associated Techniques
In previous foils, predictive modeling in essence includes
other operations shown in the above table.
Descriptive: The dealer sold 200 cars last month.
Operational (OLTP)
Explanatory: For every increase in 1 % in the
interest,
auto sales decrease by 5 %.
Traditional DW
OLAP
Predictive: predictions about future buyer behavior.
Data Mining
Level of Modeling vs. Level of Analytical Processing
Descriptive Explanatory Predictive
SIMPLE QUERIES WHAT IF
& REPORTS PROCESSING DETERMINE IF
ANY PATTERNS
ANALYZE WHAT EXIST BY REVIEWING
HAS PREVIOUSLY DATA RELATIONSHIPS
OCCURRED TO
BRING ABOUT THE
CURRENT STATE
OF THE DATA
Normalized Denormaliz + Statistical Analysis/
Tables ed Artificial Intelligence
Tables
Roll-up; Drill Down Classification & Value Predictio
Predictive Modelling
Similar to the human learning experience
uses observations to form a model of the important
characteristics of some phenomenon.
Uses generalizations of real world and ability to fit
new data into a general framework.
Can analyze a database to determine essential
characteristics (model) about the data set.
Predictive Modelling
Model is developed using a supervised learning
approach, which has two phases: training and testing.
Training builds a model using a large sample of
historical data called a training set.
Testing involves trying out the model on new,
previously unseen data to determine its accuracy
and physical performance characteristics.
Predictive Modelling
Applications of predictive modelling include customer
retention management, credit approval, cross selling,
and direct marketing.
Two techniques associated with predictive modelling:
A. classification
B. value prediction, distinguished by nature of the
variable being predicted.
Statistical Analysis of Actual Sales (dollars
and quantities) relative To these Signage
Variables-a predictive modeling example.
Content
Frequency
Depth
Focus
Depth
Scale
Length
Location
Statistical Analysis : Correlation, Regression, Experiment Design,
Optimization. Now it goes into real time analysis.
Signage
Signage
PREDICTIVE MODELING
There are two techniques associated with predictive
modeling: classification and value prediction, which are
distinguished by the nature of the variable being
predicted.
Predictive Modelling - Classification
Used to establish a specific predetermined class for
each record in a database from a finite set of possible,
class values.
Two specializations of classification: tree induction and
neural induction.
Example of Classification using
Tree Induction
Example of Classification using
Tree Induction
Customer renting
property
> No
2 years
Yes
Rent Customer age>45
property
No Yes
Rent Buy property
property
Example of Classification using
Neural Induction
Example of Classification using
Neural Induction
Each processing unit (circle) in one layer is connected
to each processing unit in the next layer by a weighted
value, expressing the strength of the relationship. The
network attempts to mirror the way the human brain
works in recognizing patterns by arithmetically
combining all the variables with a given data point.
In this way, it is possible to develop nonlinear
predictive models that learn by studying
combinations of variables and how different
combinations of variables affect different data sets.
Predictive Modelling - Value
Prediction
Used to estimate a continuous numeric value that is
associated with a database record.
Uses the traditional statistical techniques of linear
regression and non-linear regression.
Relatively easy-to-use and understand.
Predictive Modelling - Value
Prediction
Linear regression attempts to fit a straight line through
a plot of the data, such that the line is the best
representation of the average of all observations at that
point in the plot.
Problem is that the technique only works well with
linear data and is sensitive to the presence of outliers
(i.e.., data values, which do not conform to the expected
norm).
Predictive Modelling - Value
Prediction
Although non-linear regression avoids the main
problems of linear regression, still not flexible enough
to handle all possible shapes of the data plot.
Statistical measurements are fine for building linear
models that describe predictable data points, however,
most data is not linear in nature.
Predictive Modelling - Value
Prediction
Data mining requires statistical methods that can
accommodate non-linearity, outliers, and non-numeric
data.
Applications of value prediction include credit card
fraud detection or target mailing list identification.
Database Segmentation
Aim is to partition a database into an unknown number
of segments, or clusters, of similar records.
Uses unsupervised learning to discover homogeneous
sub-populations in a database to improve the accuracy
of the profiles.
Database Segmentation
Less precise than other operations thus less sensitive to
redundant and irrelevant features.
Sensitivity can be reduced by ignoring a subset of the
attributes that describe each instance or by assigning a
weighting factor to each variable.
Applications of database segmentation include
customer profiling, direct marketing, and cross selling.
Example of Database Segmentation
using a Scatter plot
Database Segmentation
Associated with demographic or neural clustering
techniques, distinguished by:
Allowable data inputs
Methods used to calculate the distance between
records
Presentation of the resulting segments for analysis.
Example of Database Segmentation
using a Visualization
Link Analysis
Aims to establish links (associations) between records,
or sets of records, in a database.
There are three specializations
Associations discovery
Sequential pattern discovery
Similar time sequence discovery
Applications include product affinity analysis, direct
marketing, and stock price movement.
Link Analysis - Associations
Discovery
Finds items that imply the presence of other items in
the same event.
Affinities between items are represented by association
rules.
e.g. When customer rents property for more than 2
years and is more than 25 years old, in 40% of cases,
customer will buy a property. Association happens
in 35% of all customers who rent properties.
Link Analysis - Sequential Pattern
Discovery
Finds patterns between events such that the presence of
one set of items is followed by another set of items in a
database of events over a period of time.
e.g. Used to understand long term customer buying
behaviour.
Link Analysis - Similar Time
Sequence Discovery
Finds links between two sets of data that are time-
dependent, and is based on the degree of similarity
between the patterns that both time series demonstrate.
e.g. Within three months of buying property, new
home owners will purchase goods such as cookers,
freezers, and washing machines.
Deviation Detection
Relatively new operation in terms of commercially
available data mining tools.
Often a source of true discovery because it identifies
outliers, which express deviation from some previously
known expectation and norm.
Deviation Detection
Can be performed using statistics and visualization
techniques or as a by-product of data mining.
Applications include fraud detection in the use of credit
cards and insurance claims, quality control, and defects
tracing.
A Summary: Data-Driven
Techniques*
Data Visualization
Decision Trees
Clustering
Factor Analysis
Neural Network
Association Rules
Rule Induction
* Based on Sakhr Younesss book Professional Data Warehousing with SQL Server 7.0 and
OLAP Services
Data Visualization
A pie chart showing the sales of a product by region is
Sometimes much more effective than presenting the s
Data in a text or tabular form.
9%
Northeast South 11 %
39% North
21 %
West
20 %
East
Decision Tree
Cluster Analysis
First segment (high income>8,000)
Have
Children
Second Segment (8000>middle income >3000)
Married
Third Segment (low income < 3000) Last car is
A used one
Own car
Factor Analysis
Unlike cluster analysis, factor analysis builds a model from data.
The technique finds underlying factors, also called latent
variables and provides models for these factors based on
variables in the data. For ex., a software company is considering a
survey to find out the nine most perceived attributes of one of
their products. They might categorize these products to categories
such as service for technical support, availability for training and
a help system.
Factor analysis is used for grouping together products based on a
similarity of buying patterns so that vendors may bundle several
products as one to sell them together at a lower price than their
added individual prices..
Neural Networks
Association Rules
Association models are models that examine the extent to which
values of one field depend on, or are produced by, values of
another field. These models are often referred to as Market Basket
Analysis when they are applied to retail industries to study the
buying patterns of these customers, especially in grocery and
retail stores that issue their own credit cards. Charging against
these cards gives the store the chance to associate the purchases of
customers with their identities, which allows them to study
associations among other things.
Rules Induction
This is a powerful technique that involves a large number of rules
using a set of if..then statements in the pursuit of all possible
patterns in the dataset. For ex., if the customer is a male then, if he
is between 30 and 40 years of ages, and his income is less than
$50,000 and more than $20,000, he is likely to be driving a car that
was bought as new.
A Summary: Theory-Driven
Techniques
Correlations
T-Tests
Analysis of Variables
Linear Regression
Logistic Regression
Discriminate Analysis
Forecasting Methods
Topic 7: The Data Mining Process
Define the problem.
Select the data.
Prepare the data.
Mine the data.
Deploy the model.
Take business action.
Are you ready for Data Mining?
Define the problem
A successful data mining initiative always starts with
a well-defined project. To insure that the project produces
incremental value, include an assessment of the status quo
solution and a review of technology, organization, and
business processes.
Select the data
This step involves defining your data source . (not every
data source and record is required.) The data is usually
extracted from the source system to a separate server.
Prepare the data
This step represents up to 80 percent of the total project
effort. For data mining, the data must reside in one flat
table (each record has many columns). In addition to being
the most time consuming, the step is also the most critical.
The resulting models are only as good as the data used to
create them.
Mine the data
Typically the easiest and shortest phase, this step involves
applying statistical and AI tools to create mathematical
models. Data mining typically occurs on a server separate
from the data warehousing and other corporate systems.
Deploy the Model
Model deployment is the process of implementing the
mathematical models into operational systems to improve
business results.
Take Business Action
Use the deployed model to achieve improved results to the
business problem identified at the beginning of the
process.
Step to Implement Data Mining
Discovery (patterns, relations
Prior Knowledge
Associations, etc.)
Information Model
Validation
Deployment
ARE YOU READY FOR DATA
MINING?
Just because you have a data warehouse doesnt mean
youre necessarily ready for data mining. Much of the
work our company does in the data mining arena has
more to do with data mining readiness assessment than
with actually performing data mining.
Metrics you can use to gauge your data
mining readiness
Do you have a staff of experienced knowledge workers?
Do you have the data?
Do you have marketing processes in place that can use this
data?
Do you have a business champion who can embrace the
process and results?
Do you have the technology infrastructure to support
advanced analysis?
Topic 8: Data Mining Tools
Data mining tools are typically classified by the type of
algorithm they use to identify hidden patterns. There are
many different algorithms in use, but the four most
popular are association, sequence, clustering (or
segmentation), and predictive modeling.
Data Mining Tools
There are a growing number of commercial data
mining tools on the marketplace.
Important characteristics of data mining tools include:
Data preparation facilities
Selection of data mining operations
Product scalability and performance
Facilities for visualization of results.
Data Mining vs. OLAP
They are two separate breeds of analysis with
entirely different objectives, not to mention
tools, skill sets, and implementation methods.
Data Mining
With canned reports, ad hoc querying, and
OLAP, the end user defines a hypothesis and
determines which data to examine. With data
mining, the tool identifies the hypothesis, and it
actually tells the user where in the data to start
the exploration process.
Data Mining
Rather than using SQL to filter out values and methodically
reduce the data into a concise answer set, data mining uses
algorithms that exhaustively review the relationships among
data elements to determine if any patterns exist. The whole
purpose of data mining is to yield new business information
that a business person can act on.
OLAP vs. Data Mining Tools
OLAP Tools Data Mining Tools
Are ad hoc, shrink wrapped Methods for analyzing
tools that provide an interface multiple data types
to data -- Regression Trees
-- Neural networks
Are used when you have -- Genetic algorithms
specific known questions
Are used when you dont
Looks and feels like a know what the questions are
spreadsheet that allow
rotation, slicing and graphic
Usually textual in nature
Can be deployed to large
number of users Usually deployed to a small
number of analysts
Data Mining Tools
ASSOCIATION
Association, also frequently referred to as "affinity
analysis," reviews numerous sets of items and looks for
common groupings. An example of association is market
basket analysis, which involves reviewing the products
that consumers purchase in a single trip to the grocery
store.
ASSOCIATION
Finds items that imply the presence of other items
in the same event.
Affinities between items are represented by
association rules.
e.g. When a customer rents property for more than 2
years and is more than 25 years old, in 40% of cases,
the customer will buy a property. This association
happens in 35% of all customers who rent properties.
Data Mining Tools
SEQUENCE
Sequential analysis helps data miners identify a set of
order-specific items or events. Association identifies the
existence of patterns or groups of items; sequential
analysis identifies the order of those patterns or groups of
items.
SEQUENCE
Finds patterns between events such that the presence of
one set of items is followed by another set of items in a
database of events over a period of time.
e.g. Used to understand long term customer buying
behavior.
Link Analysis - Similar Time Sequence
Discovery
Finds links between two sets of data that are time-
dependent, and is based on the degree of similarity
between the patterns that both time series demonstrate.
e.g. Within three months of buying property, new home
owners will purchase goods such as cookers, freezers, and
washing machines.
Data Mining Tools
CLUSTERING
Cluster analysis lets the data miner assemble data into
unforeseen groups containing similar characteristics. Also
known as "segmentation," this type of data
mining is probably the most widely used.
CLUSTERING
Aim is to partition a database into an unknown number of
segments, or clusters, of similar records.
Uses unsupervised learning to discover homogeneous sub-
populations in a database to improve the accuracy of the
profiles.
Data Mining Tools
PREDICTIVE MODELING
As the name implies, predictive modeling involves
developing a model from historical data for predicting a
future event. The power of predictive modeling engines is
that they can use a broad range of data attributes to identify
future behavior. Both cluster analysis and predictive
modeling tools identify distinct groups of items with
common attributes; the difference is that predictive modeling
focuses on the likelihood of a particular outcome for a
particular group.
Topic 9: Data Mining Techniques- A
Summary
Artificial neural networks: Non-linear predictive models that
learn through training and resembles biological neural networks
in structure.
Decision Trees: Tree-shaped structures that represent sets of
decisions. These decisions generate rules for the classification of a
database.
Generic Algorithms: Optimization techniques that use processes
such as generic combination, mutation, and natural selection in a
design based on the concepts of revolution.
Rule induction: The extraction of useful if-then rules from data
based on statistical significance.
Data Mining Techniques- A
Summary
Predictive modeling Classification
Value prediction
Database Segmentation Demographic clustering
Neural clustering
Link analysis
Association discovery
Sequential pattern discovery
Similar time sequence
discovery
Deviation detection Statistics
Visualization
Two Types of Data Mining Modeling-
Verification and Discovery
The verification model utilizes a process that looks in a
database to detect trends and patterns in data that will help
answer some specific questions about the business.
In this mode, the user generates a hypothesis about the
data, issues a query against the data and examines the
results of the query looking for verification of the
hypothesis or the user decides that the hypothesis is not
valid.
Verification Model
In this model, very little information is created in this
extraction process: either the hypothesis is verified or it is
not.
Common tools used in this mode are: queries,
multidimensional analysis and visualization. What all have
in common are that the user is essentially guiding the
exploration of the data being inspected.
Discovery Model
A more popular model is the Discovery Model that utilizes
a process that looks in a database to discover and/or
predict future patterns. The discovery model is divided
into two modes: Descriptive and Predictive.
Discovery Model- Descriptive Mode
The Descriptive mode finds hidden patterns without a
predetermined idea or hypothesis about what the patterns
may be. In other words, the Data Mining software or
program takes the initiative in finding what the interesting
patterns are, without the user thinking of the relevant
questions first. In this mode information is created about
the data with very little or guidance from the user. The
exploration of the data is done in such a way as to yield as
large a number of useful facts about the data in the shortest
amount of time.
Discovery Model- Predictive Mode
In the Predictive mode patterns discovered from the database are used
to predict the future patterns or trends. Predictive modeling allows the
user to submit records with some unknown field values, and the
system will guess the unknown values based on previous patterns
discovered from the database.
In comparing the two models, one can state that Verification can be
very inefficient, timely and costly. Whereas, Discovery modeling
can be very efficient, cost effective, less dependent on user input and
increases modeling accuracy.