*Slides edited from Han and Kamber’s online lecture
Data Mining (DM)
Lecture 2: Data Mining and
its Applications
Ms Ansif Arooj, University of Education, S & T, Township Campus Lahore
Why Data Mining?
The Explosive Growth of Data: from terabytes to petabytes
Data collection and data availability
Automated data collection tools, database systems,
Web, computerized society
Major sources of abundant data
Business: Web, e-commerce, transactions, stocks, …
Science: Remote sensing, bioinformatics, scientific
simulation, …
Society and everyone: news, digital cameras, YouTube
We are drowning in data, but starving for knowledge!
“Necessity is the mother of invention”—Data mining—
Automated analysis of massive data sets
2
What Is Data Mining?
Data mining (knowledge discovery from data)
Extraction of interesting (non-trivial, implicit, previously
unknown and potentially useful) patterns or knowledge from
huge amount of data
Data mining: a misnomer?
Alternative names
Knowledge discovery (mining) in databases (KDD),
knowledge extraction, data/pattern analysis, data
archeology, data dredging, information harvesting, business
intelligence, etc.
3
Knowledge Discovery (KDD) Process
This is a view from typical database
systems and data warehousing
Pattern Evaluation
communities
Data mining plays an essential role in
the knowledge discovery process
Data Mining
Task-relevant Data
Data Warehouse Selection
Data Cleaning
Data Integration
4 Databases
Cubic view of Data
Aggregation Hierarchies
6 © Prentice Hall
Data Warehousing
“Subject-oriented, integrated, time-variant, nonvolatile”
William Inmon
Operational Data: Data used in day to day needs of
company.
Informational Data: Supports other functions such as
planning and forecasting.
Data mining tools often access data warehouses rather than
operational data.
DM: May access data in warehouse.
7 © Prentice Hall
Operational vs. Informational
Operational Data Data Warehouse
Application OLTP OLAP
Use Precise Queries Ad Hoc
Temporal Snapshot Historical
Modification Dynamic Static
Orientation Application Business
Data Operational Values Integrated
Size Gigabits Terabits
Level Detailed Summarized
Access Often Less Often
Response Few Seconds Minutes
Data Schema Relational Star/Snowflake
8 © Prentice Hall
OLAP
Online Analytic Processing (OLAP): provides more complex
queries than OLTP.
OnLine Transaction Processing (OLTP): traditional
database/transaction processing.
Dimensional data; cube view
Visualization of operations:
Slice: examine sub-cube.
Dice: rotate cube to look at another dimension.
Roll Up/Drill Down
DM: May use OLAP queries.
9 © Prentice Hall
OLAP Operations
Roll Up
Drill Down
Single Cell Multiple Cells Slice Dice
10 © Prentice Hall
Data Mining in Business Intelligence
Increasing potential
to support
business decisions End User
Decision
Making
Data Presentation Business
Analyst
Visualization Techniques
Data Mining Data
Information Discovery Analyst
Data Exploration
Statistical Summary, Querying, and Reporting
Data Preprocessing/Integration, Data Warehouses
DBA
Data Sources
Paper, Files, Web documents, Scientific experiments, Database Systems
11
KDD Process: A Typical View from ML and Statistics
Input Data Data Pre- Data Post-
Processing Mining Processing
Data integration Pattern discovery Pattern evaluation
Normalization Association & correlation Pattern selection
Feature selection Classification Pattern interpretation
Clustering
Dimension reduction Pattern visualization
Outlier analysis
…………
This is a view from typical machine learning and statistics communities
12
Multi-Dimensional View of Data Mining
Data to be mined
Database data (extended-relational, object-oriented, heterogeneous,
legacy), data warehouse, transactional data, stream, spatiotemporal,
time-series, sequence, text and web, multi-media, graphs & social and
information networks
Knowledge to be mined (or: Data mining functions)
Characterization, discrimination, association, classification, clustering,
trend/deviation, outlier analysis, etc.
Descriptive (mining tasks characterize properties of the data in a
target data set.) vs. predictive data mining (mining tasks perform
induction on the current data in order to make predictions).
Multiple/integrated functions and mining at multiple levels
Techniques utilized
Data-intensive, data warehouse (OLAP), machine learning, statistics,
pattern recognition, visualization, high-performance, etc.
13
FUNCTION OF DATA MINING
14
Application of Data Mining
Spatial Data Analysis
Information Retrieval
Pattern Recognition
Image Analysis
Signal Processing
Computer Graphics
Web Technology
Business
Bioinformatics
Data Mining Function: (1) Generalization
Information integration and data warehouse construction
Data cleaning, transformation, integration, and
multidimensional data model
Data cube technology
Scalable methods for computing (i.e., materializing)
multidimensional aggregates
OLAP (online analytical processing)
Multidimensional concept description:
Characterization and discrimination
Generalize, summarize,
and contrast data characteristics
17
Data Mining Function: (2) Association and
Correlation Analysis
Frequent patterns (or frequent item sets)
What items are frequently purchased together in your shopping
mall?
Association, correlation vs. causality
A typical association rule
Butter, Bread Milk [20%, 100%] [support, confidence)
RentType(X, "game") AND Age(X, "13-19") -> Buys(X, "pop")
[s=2% ,c=55%]
Are strongly associated items also strongly correlated?
How to mine such patterns and rules efficiently in large datasets?
How to use such patterns for classification, clustering, and other
applications?
18
Support(x->y)=P(XUY)
Confidence(x->y)=XUY/X
Example: Butter, Bread Milk
Small Data Mining Example
X Y
Butter, Bread Milk
[20%, 100%] (support, confidence)
Support
the proportion of transactions in the data set which
contain the itemset.
1/5
Confidence
1/1
CLASS ACTIVITY
Rule 1) Coke, burger Diapers
Rule 2) Coke, burger, Potatoes bread
Rule 3) Coke, burger, potatoes onion, bread
Rule 4) burger, potatoes, onion coke
Transaction Coke Burger Potatoes Onion Diapers Bread
ID
1 1 1 1 1
2 1 1 1
3 1 1 1 1
4 1 1 1 1 1
5 1
6 1 1 1
7 1 1 1 1
8 1 1 1 1 1
9 1 1
10 1 1 1 1 1 1
Take home activity
Data Mining Function: (3) Classification
Classification and label prediction
Construct models (functions) based on some training examples
Also named as supervised classification
Describe and distinguish classes or concepts for future prediction
E.g., classify countries based on (climate), or classify cars
based on (gas mileage)
Typical methods
Decision trees, naïve Bayesian classification, support vector
machines, neural networks, rule-based classification, pattern-
based classification, logistic regression, …
Typical applications:
Credit card fraud detection, direct marketing, classifying stars,
diseases, web-pages, …
24
Classification
Data Mining Function: (4) Cluster Analysis
Unsupervised learning (i.e., Class label is unknown)
Group data to form new categories (i.e., clusters), e.g.,
cluster houses to find distribution patterns
Principle: Maximizing intra-class similarity & minimizing
interclass similarity
Many methods and applications
27
Clustering
Data Mining Function: (5) Outlier Analysis
Outlier analysis
Outlier: A data object that does not comply with the
general behavior of the data
Noise or exception? ― One person’s garbage could be
another person’s treasure
Methods: by product of clustering or regression analysis,
…
Useful in fraud detection, rare events analysis
29
Data Mining Function: (6) Prediction
The major idea is to use a large number of past
values to consider probable future values.
Forecasting and predicting the unavailable data
values or a class label for some data.
Evaluation of Knowledge
Are all mined knowledge interesting?
One can mine tremendous amount of “patterns”
Some may fit only certain dimension space (time,
location, …)
Some may not be representative, may be transient, …
Evaluation of mined knowledge → directly mine only
interesting knowledge?
Descriptive vs. predictive
Coverage
Typicality vs. novelty
Accuracy
Timeliness
…
31
Data Mining: Confluence of Multiple Disciplines
Machine Pattern Statistics
Learning Recognition
Applications Data Mining Visualization
Algorithm Database High-Performance
Technology Computing
32
Summary
Data mining: Discovering interesting patterns and knowledge from massive
amount of data
A natural evolution of science and information technology, in great demand,
with wide applications
A KDD process includes data cleaning, data integration, data selection,
transformation, data mining, pattern evaluation, and knowledge presentation
Mining can be performed in a variety of data
Data mining functionalities: characterization, discrimination, association,
classification, clustering, trend and outlier analysis, etc.
Data mining technologies and applications
Major issues in data mining
33
Class Activity
Discuss whether or not each of the following activities is a
data mining task.
A) Dividing the customers of a company according to
their gender.
No. This is a simple database query.
B) Dividing the customers of a company according to
their profitability.
No. This is an accounting calculation, followed by the
application of a threshold. However, predicting the
profitability of a new customer would be data mining.
Data Mining yes/no?
(c) Computing the total sales of a company.
No. Again, this is simple accounting.
(d) Sorting a student database based on student identification
numbers.
No. Again, this is a simple database query.
(e) Predicting the outcomes of tossing a (fair) pair of dice.
No. Since the die is fair, this is a probability calculation. If the die
were not fair, and we needed to estimate the probabilities of each
outcome from the data, then this is more like the problems
considered by data mining. However, in this specific case, solutions
to this problem were developed by mathematicians a long time ago,
and thus, we wouldn’t consider it to be data mining.
Data Mining yes/no?
(f)Predicting the future stock price of a company using
historical records.
Yes. We would attempt to create a model that can
predict the continuous value of the stock price. This is
an example of the area of data mining known as
predictive modelling. We could use regression for this
modelling, although researchers in many fields have
developed a wide variety of techniques for predicting
time series.
Data Mining yes/no?
(g) Monitoring the heart rate of a patient for
abnormalities.
Yes. We would build a model of the normal behavior
of heart rate and raise an alarm when an unusual heart
behavior occurred. This would involve the area of data
mining known as anomaly detection. This could also
be considered as a classification problem if we had
examples of both normal and abnormal heart behavior.
Data Mining yes/no?
(h) Monitoring seismic waves for earthquake
activities.
Yes. In this case, we would build a model of different
types of seismic wave behavior associated with
earthquake activities and raise an alarm when one of
these different types of seismic activity was observed.
This is an example of the area of data mining known as
classification.
Extracting the frequencies of a sound wave.
No. This is signal processing.
Data Mining and Data Privacy
For each of the following data sets, explain whether or not data
privacy is an important issue.
(a) Census data collected from 1900–1950.
No
(b) IP addresses and visit times of Web users who visit your Website.
Yes
(c) Images from Earth-orbiting satellites.
No
(d) Names and addresses of people from the telephone book.
No
(e) Names and email addresses collected from the Web.
No
Recommended Reference Books
E. Alpaydin. Introduction to Machine Learning, 2nd ed., MIT Press, 2011
S. Chakrabarti. Mining the Web: Statistical Analysis of Hypertex and Semi-Structured Data. Morgan Kaufmann, 2002
R. O. Duda, P. E. Hart, and D. G. Stork, Pattern Classification, 2ed., Wiley-Interscience, 2000
T. Dasu and T. Johnson. Exploratory Data Mining and Data Cleaning. John Wiley & Sons, 2003
U. M. Fayyad, G. Piatetsky-Shapiro, P. Smyth, and R. Uthurusamy. Advances in Knowledge Discovery and Data Mining.
AAAI/MIT Press, 1996
U. Fayyad, G. Grinstein, and A. Wierse, Information Visualization in Data Mining and Knowledge Discovery, Morgan
Kaufmann, 2001
J. Han, M. Kamber, and J. Pei, Data Mining: Concepts and Techniques. Morgan Kaufmann, 3 rd ed. , 2011
T. Hastie, R. Tibshirani, and J. Friedman, The Elements of Statistical Learning: Data Mining, Inference, and Prediction, 2 nd
ed., Springer, 2009
B. Liu, Web Data Mining, Springer 2006
T. M. Mitchell, Machine Learning, McGraw Hill, 1997
Y. Sun and J. Han, Mining Heterogeneous Information Networks, Morgan & Claypool, 2012
P.-N. Tan, M. Steinbach and V. Kumar, Introduction to Data Mining, Wiley, 2005
S. M. Weiss and N. Indurkhya, Predictive Data Mining, Morgan Kaufmann, 1998
I. H. Witten and E. Frank, Data Mining: Practical Machine Learning Tools and Techniques with Java Implementations,
40 Morgan Kaufmann, 2 nd ed. 2005