Web Mining: Faculty of Information Technology Department of Software Engineering and Information Systems
Web Mining: Faculty of Information Technology Department of Software Engineering and Information Systems
Web Mining: Faculty of Information Technology Department of Software Engineering and Information Systems
PART 1
1
Outline
Introduction
Motivation: Why data mining?
What is data mining?
Business Applications of data mining
Data Mining: On what kind of data?
Data mining functionality
Are all the patterns interesting?
Classification of data mining systems
Major issues in data mining
2
Motivation:
“Necessity is the Mother of Invention”
Data management
data storage and retrieval
database transaction processing
4
Evolution of Database Technology
1960s:
Data collection, database creation, IMS and network DBMS
1970s:
Relational data model, relational DBMS implementation
1980s:
RDBMS, advanced data models (extended-relational, OO, deductive,
etc.)
Application-oriented DBMS (spatial, scientific, engineering, etc.)
1990s:
Data mining, data warehousing, multimedia databases, and Web
databases
2000s
Stream data management and mining
Data mining with a variety of applications
Web technology and global information systems
5
What Is Data Mining?
Data mining (knowledge discovery from data)
Extraction of interesting (non-trivial, implicit, previously
unknown and potentially useful) patterns or knowledge from
huge amount of data
Data mining: a misnomer?
Alternative names
Knowledge discovery (mining) in databases (KDD), knowledge
extraction, data/pattern analysis, data archeology,
information harvesting, business intelligence, etc.
What is not data mining?
(Deductive) query processing.
Expert systems or small ML/statistical programs
6
Why Data Mining?—Potential Applications
8
Market Analysis and Management
Customer profiling
What types of customers buy what products (clustering or
classification)
9
Corporate Analysis & Risk
Management
Finance planning and asset evaluation
cash flow analysis and prediction
contingent claim analysis to evaluate assets
cross-sectional and time series analysis (financial-ratio, trend
analysis, etc.)
Resource planning
summarize and compare the resources and spending
Competition
monitor competitors and market directions
group customers into classes and a class-based pricing
procedure
set pricing strategy in a highly competitive market
10
Fraud Detection &
Mining Unusual Patterns
Approaches: Clustering & model construction for frauds, outlier
analysis, based on historical data
Applications: Health care, retail, credit card service,
telecomm.
Auto insurance: detect a group of people who stage accidents
to collect insurance
Money laundering: suspicious monetary transactions
Medical insurance
Professional patients, ring of doctors, and ring of references
Unnecessary or correlated screening tests
11
Fraud Detection &
Mining Unusual Patterns
Detecting inappropriate medical treatment
Australian Health Insurance Commission identifies that in many
cases blanket screening tests were requested (save Australian
$1m/yr)
Telecommunications: phone-call fraud
Phone call model: destination of the call, duration, time of day or
week. Analyze patterns that deviate from an expected norm
British Telecom identified discrete groups of callers with frequent
intra-group calls, especially mobile phones, and broke a
multimillion dollar fraud.
Retail industry
Analysts estimate that 38% of retail shrink is due to dishonest
employees
Anti-terrorism
12
Financial Data Analysis
Financial data
complete
reliable
high quality
13
Loan payment prediction and
customer credit policy analysis
Factors influencing loan payment performance
loan-to-value ratio
term of the loan
debt ratio (total monthly debt/total monthly income)
payment-to-income ratio
income level
education level
residence region
credit history
14
Data Mining for the Retail Industry
Multidimensional analysis of sales, customers,
products, time and region
OLAP cubes
Effectiveness of sales campaigns
Advertisements, coupons, discounts, bonuses
promote products and attract customers
can help improve profits
Compare amount of sales and number of transactions
during the sales period versus before or after the sales campaign
Association analysis
which items are likely to be purchased together with the items
on sale
15
Data Mining for the Retail Industry
Customer retention Analysis of Customer loyalty
sequences of purchases of particular customers
goods purchased at different periods by the same customers
can be grouped into sequences
changes in customer consumption or loyalty
suggests adjustments on the pricing and variety of goods
to retain old customers and attract new customers
Purchase recommendation and cross-reference of
items
associations from sales records
a customer who buy a PC is likely to buy a printer
purchase recommendations
16
Data Mining for the
Telecommunication Industry
Telecommunication data are multidimensional
calling-time duration
location of caller location of called
type of call
used to identify and compare
data traffic system workload
resource usage user group behaviour
profit
fraudulent pattern analysis and identification of
unusual patterns
to achieve customer loyalty
characteristics of customers affecting line usage
17
Other Applications
Sports
IBM Advanced Scout analyzed NBA game statistics (shots
blocked, assists, and fouls) to gain competitive advantage for
New York Knicks and Miami Heat
Internet Web Surf-Aid
IBM Surf-Aid applies data mining algorithms to Web access
logs for market-related pages to discover customer
preference and behavior pages, analyzing effectiveness of
Web marketing, improving Web site organization, etc.
18
Data Mining: A KDD Process
Data mining—core of Pattern Evaluation
knowledge discovery
process Data Mining
Task-relevant Data
Data Cleaning
Data Integration
Databases
19
Steps of a KDD Process
Learning the application domain
relevant prior knowledge and goals of application
Creating a target data set: data selection
Data cleaning and preprocessing: (may take 60% of effort!)
Data reduction and transformation
Find useful features, dimensionality/variable reduction, invariant
representation.
Choosing functions of data mining
summarization, classification, regression, association, clustering.
Choosing the mining algorithm(s)
Data mining: search for patterns of interest
Pattern evaluation and knowledge presentation
visualization, transformation, removing redundant patterns, etc.
Use of discovered knowledge
20
Data Mining and Business
Intelligence
Increasing potential
to support
business decisions End User
Making
Decisions
Data Exploration
Statistical Analysis, Querying and Reporting
Pattern evaluation
Data
Databases Warehouse
22
Data Mining: On What Kinds of Data?
Relational database
Data warehouse
Transactional database
Advanced database and information repository
Object-relational database
Time-series data
Multimedia database
23
Data Mining Functionalities
Concept description: Characterization and discrimination
Generalize, summarize, and contrast data characteristics, e.g., dry vs. wet regions
24
Data Mining Functionalities
Cluster analysis
Class label is unknown: Group data to form new classes, e.g., cluster
Outlier analysis
Outlier: a data object that does not comply with the general
behavior of the data
Noise or exception? No! useful in fraud detection, rare events
analysis
Trend and evolution analysis
Trend and deviation: regression analysis
Similarity-based analysis
25
Are All the “Discovered” Patterns
Interesting?
27
Data Mining:
Confluence of Multiple Disciplines
Database
Statistics
Systems
Machine
Learning
Data Mining Visualization
Algorithm Other
Disciplines
28
Data Mining: Classification Schemes
General functionality
Descriptive data mining
Predictive data mining
29
Two Styles of Data Mining
Descriptive data mining
Characterize the general properties of the data in
the database
finds patterns in data and
user determines which ones are important
Predictive data mining
perform inference on the current data to make
predictions
we know what to predict
Not mutually exclusive
used together
30
Descriptive Data Mining
Discovering new patterns inside the data
Used during the data exploration steps
what is in the data
what does it look like
are there any unusual patterns
what dose the data suggest for customer
segmentation
users may have no idea
which kind of patterns may be interesting
31
Descriptive Data Mining
Patterns at various granularities
geography
country - city - region - street
student
university - faculty - department - minor
32
A Model is a Black Box
X: vector of independent variables
Y =f(X) : an unknown function
33
Predictive Data Mining
Using known examples the model is trained
the unknown function is learned from data
Used to predict outcomes whose inputs are known but the output
values are not realized yet
35
Multi-Dimensional View of Data
Mining
Data to be mined
Relational, data warehouse, transactional, stream, object-
oriented/relational, active, spatial, time-series, text, multi-media,
heterogeneous, legacy, WWW
Knowledge to be mined
Characterization, discrimination, association, classification, clustering,
trend/deviation, outlier analysis, etc.
Multiple/integrated functions and mining at multiple levels
Techniques utilized
Database-oriented, data warehouse (OLAP), machine learning,
statistics, visualization, etc.
Applications adapted
Retail, telecommunication, banking, fraud analysis, bio-data mining, stock
market analysis, Web mining, etc.
36
OLAP Mining: Integration of Data Mining and
Data Warehousing
37
An OLAM
Mining query
Architecture
Mining result Layer4
User Interface
User GUI API
Layer3
OLAM OLAP
Engine Engine OLAP/OLAM
Layer2
MDDB
MDDB
Meta Data
38
Major Issues in Data Mining
Mining methodology
Mining different kinds of knowledge from diverse data types, e.g., bio, stream, Web
Performance: efficiency, effectiveness, and scalability
Pattern evaluation: the interestingness problem
Incorporation of background knowledge
Handling noise and incomplete data
Parallel, distributed and incremental mining methods
Integration of the discovered knowledge with existing one: knowledge fusion
User interaction
Data mining query languages and ad-hoc mining
Expression and visualization of data mining results
Interactive mining of knowledge at multiple levels of abstraction
Applications and social impacts
Domain-specific data mining & invisible data mining
Protection of data security, integrity, and privacy
39
An Example Problem
All Electronic is a multi branch retail company
relational tables include
customer
ID, name, address, age, income, education, sex, m_status
items
ID, name, brand, category, type, price, place_made, supplier, cost
employee
ID, name, department, education, salary
branch
purchases
transID, item_sold, customer ID, emp_ID, date, time, method_paid,
amount
40
Concept Description
Characterization
Discrimination
Data
classes or
concepts
Classes of items for sale
computers, printers
Concepts of customers:
BigSpenders
BudgetSpenders
41
Data Characterization
Summarization the data of the class under
study (target class)
Methods
OLAP roll up-operation
user-controlled data summarization
along a specified dimension
attribute oriented induction
without step by step user interaction
The output of characterization
pie charts, bar chars, curves, multidimensional data
cube, or cross tabs
in rule form as characteristic rules
42
Characterization Example
Description summarizing the characteristics of
customers who spend more than $1000 a year
at All Electronics
age, employment, income
drill down on any dimension
on occupation view these according to their type of
employment
43
Data Discrimination
Comparing the target class with one or a set
of comparative classes (contrasting classes)
these classes can be specified by the use
Database queries
Methods and output
similar to those used for characterization
include comparative measures to distinguish
between the target and contrasting classes
44
Discrimination Examples
Compare the general features of software products
whose sales increased by %10 in the last year
whose sales decreased by at least %30 during the same period
Compare two groups of AE customers
I) who shop for computer products regularly
more than two times a month
II) who rarely shop for such products
less than three times a year
The resulting description:
%80 of I group customers
university education
ages 20-40
%60 of II group customers
seniors or young
no university degree
45
Multidimensional Data
According to sales region month and
product type Dimensions: Product, Location, Time
Hierarchical summarization paths
Office Day
Month
46
Association Analysis
Discovery of association rules showing
attribute-value conditions that occur frequently
together in a given set of data
Widely used
market basket
transaction data analysis
More formally
X Y that is
A1A2.. Ak B1B2.. Bl
A1 , B1 are attribute value pairs
47
Example: Association Analysis
From the AllEs database
age(X,”20..29”)income(X,”20K...40K”)buy(X,”CD player”)
(support = %2,
confidence= %60)
X is a variable representing a customer
%2 of the AE customers are
between 20 and 29 age
incomes ranging from 20K to 40K
With %60 probability that customers in those age and
income groups will buy CD player
A multidimensional association rule
contains more than one attribute or predicate
48
Market Basket Analysis
Customers buying behaviour is
investigated
Based on only the transactions data
no information about customer properties:
age income
Managers
are interested in which products or product
groups are sold together
49
Example: Basket Analysis Rule
buy(computer)buy(printer)
(support= %1,confidence=%60)
%1 of all transactions contains
computer and printer
if a transaction contains computer
there is a %50 chance that it contains printer as well
a single dimensional association rule
contains a single predicate
an association rule is interesting if
its support exceeds a minimum threshold and
its confidence exceeds a min threshold
These min values are set by specialists
50
Classification and Prediction
Finding models (functions) that describe and distinguish classes
or concepts for future prediction
The derived model is based on the analysis of a set of training
data (object whose class labels is known)
E.g., classify countries based on climate, or classify cars based on
gas mileage
Presentation: decision-tree, classification rule, neural network
Prediction: Predict some unknown or missing numerical values
May need to be preceded by relevance analysis which attempts to
identify attributes that do not contribute to the classification or
prediction process
These attributes can be excluded
51
Steps of Classification Process
Train the model
using a training set
Test the model
on a test sample
whose class labels are known but not used
for training the model
Use the model for classification
on new data whose class labels are
unknown
52
Example
wealth
OK
DEFAULT
Yearly income
53
Decision Trees
54
Solution
x : wealth
2
OK
DEFAULT
q2
q1 x1 : yearly income
56
Artificial Neural Nets: Perceptron
x0=+1
x1
w1 w0 y g ( x1w1 x2 w2 w0 )
x2
g g ( w T x)
w2
y
wd
xd
57
Training ANNs
d
o g (wT x) g w i x i
i 0
Learning set: X x , y
t t
Find w which minimizes the error on X
2
t
E (w | X ) y t
o
t 2
y g w i x i
t X
t X i
58
ANN for Clasification
o1 o2 oK
wKd
x0=+1 x1 x2 xd
d
o tj g (wTj xt ) g w ji x it
i 0
59
Prediction Methods
linear regression
Yi = a0+a1X1,i+a2X2,i+...+akXk,i+ui
non-linear regression
Yi =f(X1,i, X2,i,.., Xk,ia1,a2,..,ak,ui)
generalized linear regression
logistic
logit,probit
when the dependent variable is categorical
good customer bad customer or employed unemployed
pason regression
for count variables
60
Example:Prediction and Classification
Classification is used to classify customers applying for
credit cards
known class labels: risky,reliable
when a new customer applies looking at his/her
characteristics
income age education wealth region ...
Customer class is predicted
61
Cluster Analysis
Class label is unknown: Group data to form new
classes, e.g., cluster houses to find distribution
patterns
Clustering based on the principle: maximizing the
intra-class similarity and minimizing the interclass
similarity
Objects within a cluster have high similarity in comparison to
one another
but are very dissimilar to objects in other clusters
There may be hierarchy of classes
62
Example: Clustering
Can be performed on AE customer data
to identify homogenous subpopulations of
customers
represent individual target groups for
marketing
63
Example
distance
Type1
+
Type 2
type 3
+ +
income
Clustering according to income and distance to store
three cluster of data points are evident
+ s indicate group centers
64
Outlier Analysis
Outlier: a data object that does not comply with the
general behavior of the data
It can be considered as noise or exception but is quite
useful in fraud detection, rare events analysis
Detected using
statistical tests
distance measures
visually inspecting the data
65
Reasons for Outliers
Measurement errors
Coding errors
age is entered as 999
Nature of data
salary of the general manager is much more higher
than the other employees
In different countries in crisis the interest rate was
in the order of 1000s
66
Evolution Analysis
Describes and models regularities or trends for objects
whose behavior changes over time
Distinct features include
Trend and deviation: time-series data analysis
Sequential pattern mining, periodicity analysis
Similarity-based analysis
Example
Stock market predictions: future stock prices
For overall stocks: indexes or individual company stocks
67