Data Mining
Data Mining
Data Warehousing
B.E. Computer
8th Semester
Unit 1 : Introduction to Data
Mining and Data Warehousing
What is Data?
• A representation of facts, concepts, or
instructions in a formal manner suitable
for communication, interpretation, or
processing by human beings or by
computers.
??
Wisdom
Knowledge
Information
Data
Review of basic concepts of data warehousing and data mining
Solution:
“Necessity is the mother of invention”—Data
Warehousing and Data Mining
What is Data Mining?
Database
Statistics
Systems
Machine
Learning
Data Mining Visualization
Algorithm Other
Disciplines
Knowledge Discovery in Databases (KDD)
Process
• Many people treat data mining as a synonym for
another popularly used term, Knowledge
Discovery in Database, or KDD.
• Alternatively, others view data mining as simply
an essential step in the process of knowledge
discovery.
• Data mining, or knowledge discovery in databases
(KDD) as it is also known, is the nontrivial
extraction of implicit, previously unknown, and
potentially useful information from data.
Some Alternative names to data mining are:
– Knowledge discovery (mining) in databases
(KDD)
– Knowledge extraction
– Data/pattern analysis
– Data archeology
– Data Dredging
– Information Harvesting
– Business intelligence, etc.
Figure: Data mining as a step in the process of knowledge discovery.
THE PROCESS OF KNOWLEDGE DISCOVERY
www data
MIS Systems
(Acct, HR)
TRANSFORM CLEANSE Data Warehouse
Legacy
Systems
EXTRACT LOAD
Archived data
Other indigenous applications
(COBOL, VB, C++, Java)
OLAP
Temporary
Data storage
ETL Processing
ETL is independent yet interrelated steps.
It is important to look at the big picture.
Data acquisition time may include…
Extracts
Data Index
from Data Data Data Statistics
Transfor- Mainte-
source Movement Cleansing Loading Collection
mation nance
systems
Backup
Back-up is a major task, it is a Data Warehouse not a cube
ETL is often a complex combination of process and
technology that consumes a significant portion of the data
warehouse development efforts and requires the skills of
business analysts, database designers, and application
developers
It is not a one time event as new data is added to the Data
Warehouse periodically – i.e. monthly, daily, hourly
Because ETL is an integral, ongoing, and recurring part of a
data warehouse. It may be:
Automated
Well documented
Easily changeable
When defining ETL for a data warehouse, it is important to
think of ETL as a process, not a physical implementation
Extraction, Transformation, and Loading
(ETL) Processes
Data Extraction
Data Cleansing
Data Transformation
Data Loading
Data Refreshing
Data Extraction
Capture/Extract…obtaining a snapshot of a chosen subset of the source data for
loading into the data warehouse
Record-level: Field-level:
Selection–data partitioning single-field–from one field to one field
Joining–data combining multi-field–from many fields to one, or one
Aggregation–data summarization field to many
Basic Tasks
1. Selection
2. Splitting/Joining
3. Conversion
4. Summarization
5. Enrichment
Data Loading
Data Loading
Most loads involve only change data rather
than a bulk reloading of all of the data in the
warehouse.
Data are physically moved to the data
warehouse
The loading takes place within a “load
window”
The trend is to near real time updates of the
data warehouse as the warehouse is
increasingly used for operational applications
Load/Index= place transformed data into the warehouse and create indexes
Data Refresh
Propogate updates from sources to the
warehouse
Issues:
– when to refresh
– how to refresh -- refresh techniques
Set by administrator depending on user
needs and traffic
When to Refresh?
periodically (e.g., every night, every week) or
after significant events
on every update: not warranted unless
warehouse data require current data (up to the
minute stock quotes)
refresh policy set by administrator based on
user needs and traffic
possibly different policies for different sources
Refresh Techniques
Full Extract from base tables
• read entire source table: too expensive
• maybe the only choice for legacy systems
ETL vs. ELT
ETL: Extract, Transform, Load in which data
transformation takes place on a separate
transformation server.
ELT: Extract, Load, Transform in which data
transformation takes place on the data
warehouse server.
Thank you !!!
Chapter 5 :
Data Warehouse to Data
Mining
Data Warehouse Architecture
Data Warehouse Architecture
Operational Data Sources: It may include:
• Network databases.
• Departmental file systems and RDBMSs.
• Private workstations and servers.
• External systems (Internet, commercially available
databases).
OLAP is FASMI
� Fast
� Analysis
� Shared
� Multidimensional
� Information
Main characteristics of OLAP are:
• Multidimensional conceptual view
• Multi user support
• Accessibility
• Storing
• Uniform reporting performance
• Facilitate interactive query and complex analysis
for the users.
• Provides ability to perform intricate calculations
and comparisons.
• Presents results in a number of meaningful ways,
including charts and graphs.
Comparing OLAP and Data Mining
Examples of OLAP Applications in
Various Functional Areas
OLAP Benefits
• Increased productivity of end-users.
• Retention of organizational control over
the integrity of corporate data.
• Reduced query drag and network traffic on
OLTP systems or on the data warehouse.
• Improved potential revenue and
profitability.
Strengths of OLAP
Downsides of ROLAP:
• Slow Response
• Some limitations on scalability
Multi-Dimensional OLAP (MOLAP)
• The first generation of server-based
multidimensional OLAP (MOLAP) solutions use
multidimensional databases (MDDBs).
• The main advantage of an MDDB over an RDBMS
is that an MDDB can provide information quickly
since it is calculated and stored at the
appropriate hierarchy level in advance.
• However, this limits the flexibility of the MDDB
since the dimensions and aggregations are
predefined.
Multi-Dimensional OLAP (MOLAP)
• If a business analyst wants to examine a
dimension that is not defined in the MDDB, a
developer needs to define the dimension in
the database and modify the routines used to
locate and reformat the source data before
an operator can load the dimension data.
• Another important operational consideration
is that the data in the MDDB must be
periodically updated to remain current.
• This update process needs to be scheduled
and managed. In addition, the updates need
to go through a data cleansing and validation
process to ensure data consistency.
• Finally, an administrator needs to allocate
time for creating indexes and aggregations, a
task that can consume considerable time
once the raw data has been loaded.
• These requirements also apply if the
company is building a data warehouse that is
acting as a source for the MDDB.
• Organizations typically need to invest
significant resources in implementing MDDB
systems and monitoring their daily operations.
• This complexity adds to implementation delays
and costs, and requires significant IT
involvement.
• This also results in the analyst, who is typically
a business user, having a greater dependency
on IT.
• Thus, one of the key benefits of this OLAP
technology — the ability to analyze
information without the use of IT professionals
— may be significantly diminished.
• Uses specialized data structures and
Front-end
Tool multi-dimensional Database
Management Systems (MD-DBMSs)
to organize, navigate, and analyze
data.
• Use a specialized DBMS with a model
such as the “data cube.”
Multidimensional • Data is typically aggregated and
Database
stored according to predicted usage
to enhance query performance.
Typical Architecture for MOLAP Tools
• Traditionally, require a tight coupling with the
application layer and presentation layer.
• Recent trends segregate the OLAP from the data
structures through the use of published application
programming interfaces (APIs).
• MOLAP Products
– Pilot, Arbor Essbase, Gentia
• MOLAP Tools
– ORACLE Express Server
– ORACLE Express Clients (C/S and Web)
– MicroStrategy’s DSS server
– Platinum Technologies’ Plantinum InfoBeacon
• Use array technology and efficient storage techniques
that minimize the disk space requirements through
sparse data management.
• Provides excellent performance when data is used as
designed, and the focus is on data for a specific
decision-support application.
• Features:
Very fast response
Ability to quickly write data into the cube
• Downsides:
Limited Scalability
Inability to contain detailed data
Load time
Desktop OLAP (or Client OLAP)
• The desktop OLAP market resulted from the need for
users to run business queries using relatively small data
sets extracted from production systems.
• Most desktop OLAP systems were developed as
extensions of production system report writers, while
others were developed in the early days of client/server
computing to take advantage of the power of the
emerging (at that time) PC desktop.
• Desktop OLAP systems are popular and typically require
relatively little IT investment to implement. They also
provide highly mobile OLAP operations for users who
may work remotely or travel extensively.
• However, most are limited to a single user and lack the
ability to manage large data sets.
Client-
OLAP
• proprietary data structure on the client
• data stored as file
Stores in the form • mostly RAM based architectures
of cubes/micro-
cubes on the • mobile user
desktop/client
machine • ease of installation and use
Products:
• data volume
Brio.Enterprise
BusinessObjects • no multiuser capabilities
Cognos PowerPlay
Hybrid OLAP (HOLAP)
• Hybrid OLAP (HOLAP) combines ROLAP and
MOLAP storage.
• It tries to take advantage of the strengths of
each of the other two architectures, while
minimizing their weaknesses.
• Some vendors provide the ability to access
relational databases directly from an MDDB,
giving rise to the concept of hybrid OLAP
environments.
• This implements the concept of "drill through,"
which automatically generates SQL to retrieve
detail data records for further analysis.
Hybrid OLAP (HOLAP)
• This gives end users the perception they are
drilling past the multidimensional database into
the source database.
• The hybrid OLAP system combines the
performance and functionality of the MDDB with
the ability to access detail data, which provides
greater value to some categories of users.
• However, these implementations are typically
supported by a single vendor’s databases and are
fairly complex to deploy and maintain.
• Additionally, they are typically somewhat
restrictive in terms of their mobility.
• Can use data from either a RDBMS directly or a
multi-dimension server.
• Equal treatment of MD and Relational Data
• Storage type at the discretion of the administrator
• Cube Partitioning
HOLAP System
Meta Data
• HOLAP Products:
–Oracle Express
–Seagate Holos
–Speedware Media/M
–Microsoft OLAP Services
HOLAP Features:
• For summary type info – cube, (Faster
response)
• Ability to drill down – relational data
sources (drill through detail to underlying
data)
• Source of data transparent to end-user
OLAP Products
Vendor
OLAP Category Candidate Products
ROLAP Microstrategy Microstrategy
Essbase Hyperion
Essbase Hyperion
Northeast
Southeast
Central
Northwest
Southwest
Northeast Maine
New York
Massachusetts
Southeast Florida
Georgia
Virginia
Figure: Quarterly Auto Sales Summary
Example of Roll up
Region State Units Sold Revenue
Northeast Maine
New York
Massachusetts
Southeast Florida
Georgia
Virginia
Northeast
Southeast
Central
Northwest
Southwest
Slice and Dice
The slice operation performs a selection on one
dimension of the given cube, resulting in a sub
cube.
Region
Time
Pivot
Example of Rotation (Pivot Table)
Data Mining Tools
Thank you !!!
Unit 6 : Data Mining
Approaches and Methods
Types of Data Mining Models
Predictive Model
(a)Classification -Data is mapped into predefined
groups or classes. Also termed as supervised learning
as classes are established prior to examination of
data.
(b) Regression- Mapping of data item into known
type of functions. These may be linear, logistic
functions etc.
(c) Time Series Analysis- Value of an attribute are
examined at evenly spaced times, as it varies with
time.
(d) Prediction- It means fore telling future data states
based on past and current data.
Types of Data Mining Models
Descriptive Models
(a) Clustering- It is referred as unsupervised learning or
segmentation/partitioning. In clustering groups are not
pre-defined.
(b) Summarization- Data is mapped into subsets with
simple descriptions . Also termed as Characterization
or generalization.
(c) Sequence Discovery- Sequential analysis or
sequence discovery utilized to find out sequential
patterns in data. Similar to association but relationship
is based on time.
(d) Association Rules- A model which identifies specific
types of data associations.
Descriptive vs. Predictive Data Mining
Descriptive Mining:
It describes concepts or task-relevant data sets in concise,
summarative, informative, discriminative forms.
Predictive Mining:
It is based on data and analysis, constructs models for the database,
and predicts the trend and properties of unknown data.
Supervised and Unsupervised learning
Supervised learning:
– The network answer to each input pattern is
directly compared with the desired answer
and a feedback is given to the network to
correct possible errors
Unsupervised learning:
– The target answer is unknown. The network
groups the input patterns of the training sets
into clusters, based on correlation and
similarities.
Supervised
• Bayesian Modeling Type and number of
• Decision Trees classes are known in
advance
• Neural Networks
Unsupervised
Type and number of
• One-way Clustering classes are NOT
• Two-way Clustering known in advance
Classification and Prediction
Classification and prediction are two forms of data analysis that
can be used to extract models describing important data classes
or to predict future data trends. Such analysis can help provide
us with a better understanding of the data at large.
As the regression coefficients are also considered as weights, we may write the
above equation as:
y = w+w1x
These coefficients are solved by the method of least squares, which estimates
the best fitting straight line as the one that minimizes the error between the
actual data and the estimate of the line.
Linear Regression
Classification is the process of finding a model (or
function) that describes and distinguishes data
classes or concepts, for the purpose of being able
to use the model to predict the class of objects
whose class label is unknown. The derived model is
based on the analysis of a set of training data (i.e.,
data objects whose class label is known).
<90 >=90
If x >= 90 then grade =A.
x A
If 80<=x<90 then grade =B.
<80 >=80
If 70<=x<80 then grade =C.
x B
If 60<=x<70 then grade =D.
<70 >=70
If x<50 then grade =F. x C
<50 >=60
F D
Figure: Learning
Here, the class label attribute is loan decision, and the
learned model or classifier is represented in the form of
classification rules.
Examples of Classification Algorithms
Decision Trees
Neural Networks
Bayesian Networks
Decision Trees
A decision tree is a predictive model that as its name
implies can be viewed as a tree. Specifically each branch
of the tree is a classification question and the leaves are
partitions of data set with their classification.
Outlook
Sunny Rain
Overcast
No Yes No Yes
Cons
– several tuning parameters to set with little guidance
– decision boundary is non-continuous
– Cannot handle continuous data.
– Incapable of handling many problems which cannot be divided into attribute
domains.
– Can lead to over-fitting as the trees are constructed from training data.
Neural Networks
Neural Network is a set of connected INPUT/OUTPUT
UNITS, where each connection has a WEIGHT
associated with it. It is a case of SUPERVISED,
INDUCTIVE or CLASSIFICATION learning.
x0
- mk
w0
x1 w1
å f output y
xn wn For Example
n
y sign( wi xi k )
i 0
Input weight weighted Activation
vector x vector w sum function
1 2 ... p-1 p
wp-1,1
w1,1 w1,n wp,1
...
x1 x2 xn-1 xn
n
yi t 1 f wik xk t i 1, 2, ... p
k 0
Multi-Layer Perceptron
Output vector
Err j O j (1 O j ) Errk w jk
Output nodes k
j j (l) Err j
wij wij (l ) Err j Oi
Hidden nodes
Err j O j (1 O j )(T j O j )
wij 1
Oj I j
1 e
Input nodes
I j wij Oi j
i
Input vector: xi
Advantages of Neural Network
prediction accuracy is generally high
robust, works when training examples contain errors
output may be discrete, real-valued, or a vector of
several discrete or real-valued attributes
fast evaluation of the learned target function
High tolerance to noisy data
Ability to classify untrained patterns
Well-suited for continuous-valued inputs and outputs
Successful on a wide array of real-world data
Algorithms are inherently parallel
Techniques have recently been developed for the
extraction of rules from trained neural networks
Disadvantages of Neural Network
long training time
difficult to understand the learned function
(weights)
not easy to incorporate domain knowledge
Require a number of parameters typically best
determined empirically, e.g., the network
topology or ``structure."
Poor interpretability: Difficult to interpret the
symbolic meaning behind the learned weights
and of ``hidden units" in the network
Association Rule
Proposed by Agrawal et al in 1993.
It is an important data mining model studied extensively by the database and
data mining community.
Assume all data are categorical.
No good algorithm for numeric data.
Initially used for Market Basket Analysis to find how items purchased by
customers are related.
Given a set of records each of which contain some number of items from a
given collection;
– Produce dependency rules which will predict occurrence of an item based on occurrences of
other items.
TID Items
1 Bread, Coke, Milk
Rules Discovered:
2 Beer, Bread
{Milk} --> {Coke}
3 Beer, Coke, Diaper, Milk {Diaper, Milk} --> {Beer}
4 Beer, Bread, Diaper, Milk
5 Coke, Diaper, Milk
Applications:
Basket data analysis, cross-marketing, catalog design, loss-
leader analysis, clustering, classification, etc.
E.g., 98% of people who purchase tires and auto accessories
also get automotive services done
Concepts:
An item: an item/article in a basket
I: the set of all items sold in the store
A transaction: items purchased in a basket; it may have TID (transaction ID)
A transactional dataset: A set of transactions
Confidence:
The rule holds in T with confidence conf if conf% of
transactions that contain X also contain Y.
conf = Pr(Y | X)
Customer
buys both
Customer Let minimum support 50%,
buys diaper
and minimum confidence
50%, we have
A C (50%, 66.7%)
C A (50%, 100%)
Customer
buys beer
Example
Data set D Count, Support, Confidence:
TID Itemsets Count(13)=2
T100 134 |D| = 4
Competing
objectives
Intra-cluster Inter-cluster
distances are distances are
minimized maximized
Types of Clusterings
Partitioning Clustering
– A division data objects into non-overlapping subsets (clusters) such that each data object is in
exactly one subset
– Construct various partitions and then evaluate them by some criterion, e.g., minimizing the sum
of square errors
– Typical methods: k-means, k-medoids, CLARA (Clustering LARge Applications)
Hierarchical clustering
– A set of nested clusters organized as a hierarchical tree
– Create a hierarchical decomposition of the set of data (or objects) using some criterion
– Typical methods: DiAna (Divisive Analysis), AgNes (Agglomerative Nesting), BIRCH (Balanced
Iterative Reducing and Clustering using Hierarchies), ROCK (RObust Clustering using linKs),
CAMELEON
Density-based Clustering
– Based on connectivity and density functions
– Typical methods: DBSACN (Density Based Spatial Clustering of Applications with Noise), OPTICS
(Ordering Points To Identify the Clustering Structure), DenClue (DENsity-based CLUstEring )
Grid-based Clustering
- based on a multiple-level granularity structure
- Typical methods: STING (STatistical INformation Grid ), WaveCluster, CLIQUE
(Clustering In QUEst)
Model-based Clustering
- A model is hypothesized for each of the clusters and tries to find the best fit of
that model to each other
- Typical methods: EM (Expectation Maximization), SOM (Self-Organizing Map),
COBWEB
p1
p3 p4
p2
p1 p2 p3 p4
Dendrogram 1
p1
p3 p4
p2
p1 p2 p3 p4
Dendrogram 2
Strengths of Hierarchical Clustering
Do not have to assume any particular number of
clusters
– Any desired number of clusters can be obtained by
‘cutting’ the dendogram at the proper level
Input:
k: the number of clusters,
D: a data set containing n objects.
Method:
(1) arbitrarily choose k objects from D as the initial cluster centers;
(2) repeat
(3) (re)assign each object to the cluster to which the object is the most similar,
based on the mean value of the objects in the cluster;
(4) update the cluster means, i.e., calculate the mean value of the objects for each
cluster;
(5) until no change;
Figure: Clustering of a set of objects based on the k-means method. (The mean
of each cluster is marked by a “+”.)
Example
10 10
10
9 9
9
8 8
8
7 7
7
6 6
6
5 5
5
4 4
4
Assign 3 Update 3
3
2 each
2 the 2
1
objects
1
0
cluster 1
0
0
0 1 2 3 4 5 6 7 8 9 10 to most
0 1 2 3 4 5 6 7 8 9 10 means 0 1 2 3 4 5 6 7 8 9 10
similar
center reassign reassign
10 10
K=2 9 9
8 8
Arbitrarily choose K 7 7
object as initial
6 6
5 5
2
the 3
1 cluster 1
0
0 1 2 3 4 5 6 7 8 9 10
means 0
0 1 2 3 4 5 6 7 8 9 10
K-means Clustering – Details
Initial centroids are often chosen randomly.
Clusters produced vary from one run to another.
The centroid is (typically) the mean of the points in the
cluster.
‘Closeness’ is measured mostly by Euclidean distance, cosine
similarity, correlation, etc. Typical
choice
K-means will converge for common similarity measures
mentioned above.
Most of the convergence happens in the first few iterations.
Often the stopping condition is changed to ‘Until
relatively few points change clusters’
Complexity is O( n * K * I * d )
n = number of points, K = number of clusters,
I = number of iterations, d = number of attributes
Issues and Limitations for K-means
How to choose initial centers?
How to choose K?
How to handle Outliers?
Clusters different in
Shape
Density
Size
Assumes clusters are spherical in vector space
Sensitive to coordinate changes
K-means Algorithm
Pros
Simple
Fast for low dimensional data
It can find pure sub clusters if large number of clusters is specified
Cons
K-Means cannot handle non-globular data of different sizes and densities
K-Means will not identify outliers
K-Means is restricted to data which has the notion of a center (centroid)
Applicable only when mean is defined, then what about categorical data?
Need to specify k, the number of clusters, in advance
Unable to handle noisy data and outliers
Not suitable to discover clusters with non-convex shapes
Outliers
What are outliers?
The set of objects are considerably dissimilar from the remainder
of the data
Example: Sports: Michael Jordon, Randy Orton, Sachin Tendulkar ...
Applications:
Credit card fraud detection
Telecom fraud detection
Customer segmentation
Medical analysis
Outlier detection and analysis are very useful for fraud detection,
etc. and can be performed by statistical, distance-based or
deviation-based approaches
How to handle Outliers?
The k-means algorithm is sensitive to outliers !
– Since an object with an extremely large value may substantially distort the
distribution of the data.
9 99 99
8 88 88
Arbitrary Assign
7 77 77
6 66 66
5
choose k 55 each 55
4 object as 44 remainin 44
3
initial 33
g object 33
2
medoids 22
to 22
nearest
1 11 11
0 00 00
0 1 2 3 4 5 6 7 8 9 10 00 11 22 33 44 55 66 77 88 99 10
10
medoids 00 11 22 33 44 55 66 77 88 99 10
10
Until no change
9
Compute
9
Swapping O
8 8
7 total cost of 7
and Oramdom 6
swapping 6
5 5
If quality is 4 4
improved. 3 3
2 2
1 1
0 0
0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10
Example of K-medoids
Given the two medoids that are initially chosen are A
and B. Based on the following table and randomly
placing items when distances are identical to the two
medoids, we obtain the clusters {A, C, D} and {B, E}. The
three non-medoids {C, D, E} are examined to see which
should be used to replace A or B. We have six costs to
determine: TCAC (the cost change by replacing medoid A
with medoid C), TCAD, TCAE, TCBC, TCBD and TCBE.
TCAC=CAAC+CBAC+CCAC+CDAC+CEAC = 1 + 0 – 2 – 1 + 0 = -2
Where CAAC = the cost change of object A after replacing
medoid A with medoid C
Comparison between K-means and K-
medoids
The k-medoids method is more robust than k-
means in the presence of noise and outliers
because a medoid is less influenced by outliers
or other extreme values than a mean. However,
its processing is more costly than the k-means
method. Both methods require the user to
specify k, the number of clusters.
Thank you !!!
Unit 7 : Mining Complex Types
of Data
Introduction
Internet
• Text mining is the procedure of synthesizing
information, by analyzing relations, patterns, and
rules among textual data. These procedures
contains text summarization, text categorization,
and text clustering.
40000000
35000000
30000000
25000000
Ho s ts
20000000
15000000
10000000
5000000
0
Sep-69
Sep-72
Sep-75
Sep-78
Sep-81
Sep-84
Sep-87
Sep-90
Sep-93
Sep-96
Sep-99
Growing and changing very rapidly
Broad diversity of user communities
Only a small portion of the information on the Web is truly relevant or useful
– 99% of the Web information is useless to 99% of Web users
– How can we find high-quality Web pages on a specified topic?
Web Search Engines
Index-based: search the Web, index Web pages,
and build and store huge keyword-based indices
Help locate sets of Web pages containing
certain keywords
Deficiencies
– A topic of any breadth may easily contain
hundreds of thousands of documents
– Many documents that are highly relevant to a
topic may not contain keywords defining them
(polysemy)
Web Mining: A More Challenging Task
Searches for
– Web access patterns
– Web structures
– Regularity and dynamics of Web contents
Problems
– The “abundance” problem
– Limited coverage of the Web: hidden Web sources,
majority of data in DBMS
– Limited query interface based on keyword-oriented
search
– Limited customization to individual users
Web Mining Taxonomy
Web Mining
Web Mining
Web Content
Mining
Web Structure
Web Usage
Web Page Content Mining Mining
Mining
Web Page Summarization
WebLog (Lakshmanan et.al. 1996),
WebOQL(Mendelzon et.al. 1998) …: Search Result General Access Customized
Web Structuring query languages; Mining Pattern Tracking Usage Tracking
Can identify information within given
web pages
•Ahoy! (Etzioni et.al. 1997):Uses heuristics
to distinguish personal home pages from
other web pages
•ShopBot (Etzioni et.al. 1997): Looks for
product prices within web pages
Mining the World-Wide Web
Web Mining
Web Content
Web Structure
Mining Web Usage
Mining
Mining
Web Page
Content Mining Search Result Mining
General Access Customized
Search Engine Result Pattern Tracking Usage Tracking
Summarization
•Clustering Search Result (Leouski
and Croft, 1996, Zamir and Etzioni,
1997):
Categorizes documents using
phrases in titles and snippets
Mining the World-Wide Web
Web Mining
DBMiner
– DBMiner Technology Inc developed DBMiner.
– It provides multiple data mining algorithms including discovery-
driven OLAP analysis, association, classification, and clustering
SPSS Clementine
– Integral Solutions Ltd. (ISL) developed Clementine
– Clementine has been acquired by SPSS Inc.
– An integrated data mining development environment for end-
users and developers
– Multiple data mining algorithms and visualization tools including
rule induction, neural nets, classification, and visualization tools
SPSS Clementine
Theoretical Foundations of Data Mining
Data reduction
– The basis of data mining is to reduce the data
representation
– Trades accuracy for speed in response
Data compression
– The basis of data mining is to compress the given data
by encoding in terms of bits, association rules,
decision trees, clusters, etc.
Pattern discovery
– The basis of data mining is to discover patterns
occurring in the database, such as associations,
classification models, sequential patterns, etc.
Probability theory
– The basis of data mining is to discover joint probability
distributions of random variables
Microeconomic view
– A view of utility: the task of data mining is finding
patterns that are interesting only to the extent in that
they can be used in the decision-making process of
some enterprise
Inductive databases
– Data mining is the problem of performing inductive
logic on databases,
– The task is to query the data and the theory (i.e.,
patterns) of the database
– Popular among many researchers in database systems
Statistical Data Mining
There are many well-established statistical techniques
for data analysis, particularly for numeric data
– applied extensively to data from scientific
experiments and data from economics and the
social sciences
Regression
predict the value of a response
(dependent) variable from one or more
predictor (independent) variables where
the variables are numeric
forms of regression: linear, multiple,
weighted, polynomial, nonparametric,
and robust
Generalized linear models
– allow a categorical response variable
(or some transformation of it) to be
related to a set of predictor variables
– similar to the modeling of a numeric
response variable using linear
regression
– include logistic regression and Poisson
regression
Mixed-effect models
For analyzing grouped data, i.e. data that can be classified
according to one or more grouping variables
Typically describe relationships between a response variable and
some covariates in data grouped according to one or more factors
Regression trees
– Binary trees used for classification
and prediction
– Similar to decision trees:Tests are
performed at the internal nodes
– In a regression tree the mean of the
objective attribute is computed and
used as the predicted value
Analysis of variance
– Analyze experimental data for two
or more populations described by a
numeric response variable and one
or more categorical variables
(factors)
www.spss.com/datamine/factor.htm
Factor analysis
– determine which variables are
combined to generate a given
factor
– e.g., for many psychiatric data, one
can indirectly measure other
quantities (such as test scores)
that reflect the factor of interest
Discriminant analysis
– predict a categorical response
variable, commonly used in social
science
– Attempts to determine several
discriminant functions (linear
combinations of the independent
variables) that discriminate among
the groups defined by the
response variable
Time series:
Many methods such as auto regression, ARIMA (Autoregressive integrated
moving-average modeling), long memory time-series modeling
Quality control:
Displays group summary charts
Survival analysis
Predicts the probability
that a patient undergoing a
medical treatment would
survive at least to time t
(life span prediction)
Visual and Audio Data Mining
Visualization: use of computer graphics to create visual images
which aid in the understanding of complex, often massive
representations of data
High Human
Performance Computer
Computing Interfaces
Purpose of Visualization
– Gain insight into an information space by
mapping data onto graphical primitives
– Provide qualitative overview of large data sets
– Search for patterns, trends, structure,
irregularities, relationships among data.
– Help find interesting regions and suitable
parameters for further quantitative analysis.
– Provide a visual proof of computer
representations derived
Integration of visualization and data mining
– data visualization
– data mining result visualization
– data mining process visualization
– interactive visual data mining
Data visualization
– Data in a database or data warehouse can be
viewed
• at different levels of granularity or abstraction
• as different combinations of attributes or
dimensions
– Data can be presented in various visual forms
Boxplots from Statsoft: Multiple Variable
Combinations
Data Mining Result Visualization
Understand
variations with
visualized data
Interactive Visual Data Mining
Using visualization tools in the data mining
process to help users make smart data mining
decisions
Example
– Display the data distribution in a set of attributes using
colored sectors or columns (depending on whether the
whole space is represented by either a circle or a set of
columns)
– Use the display to which sector should first be selected
for classification and where a good split point for this
sector may be
Interactive Visual Mining by Perception-Based
Classification (PBC)
Audio Data Mining
Uses audio signals to indicate the patterns of data or the
features of data mining results
An interesting alternative to visual mining
An inverse task of mining audio (such as music) databases
which is to find patterns from audio data
Visual data mining may disclose interesting patterns using
graphical displays, but requires users to concentrate on
watching patterns
Instead, transform patterns into sound and music and listen
to pitches, rhythms, tune, and melody in order to identify
anything interesting or unusual
Data Mining and Collaborative Filtering
Social Impact of Data Mining
Is Data Mining a Hype or Will It Be Persistent?
Data mining is a technology
Technological life cycle
– Innovators
– Early adopters
– Chasm
– Early majority
– Late majority
– Laggards
Life Cycle of Technology Adoption