0% found this document useful (0 votes)
55 views

LectureSlide 1

1) The document provides an outline for a course on data mining, covering an intuitive introduction, textbook chapters, and student presentations. 2) Key concepts around data, information, and knowledge are defined - data is unprocessed facts, information is interpreted data, and knowledge combines information with experience and insight. 3) Data mining aims to discover useful patterns and knowledge automatically from large amounts of data through techniques like classification, clustering, and prediction. It helps address the "data explosion problem" of having more data than the ability to analyze it.

Uploaded by

Rajni Kapoor
Copyright
© Attribution Non-Commercial (BY-NC)
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
55 views

LectureSlide 1

1) The document provides an outline for a course on data mining, covering an intuitive introduction, textbook chapters, and student presentations. 2) Key concepts around data, information, and knowledge are defined - data is unprocessed facts, information is interpreted data, and knowledge combines information with experience and insight. 3) Data mining aims to discover useful patterns and knowledge automatically from large amounts of data through techniques like classification, clustering, and prediction. It helps address the "data explosion problem" of having more data than the ability to analyze it.

Uploaded by

Rajni Kapoor
Copyright
© Attribution Non-Commercial (BY-NC)
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 12

7/22/2010

Data Mining Data Mining

Part One: Intuitive Introduction and DM Overview


Part Two: Textbook chapters
Part Three: Students Presentations
Dr Muhammad Abulaish
Dr. Course Textbook:

Reader, Dept. of Computer Science J. Han, M. Kamber


Jamia Millia Islamia, New Delhi - 25 DATA MINING
Email: [email protected] Concepts and Techniques
Morgan Kaufmann, 2003/2006

Course Outline Data


Data is the Latin plural of datum
Used to represent unprocessed facts and figures without
any added interpretation or analysis.
Click here to see the course outline
Generally associated with some entity and often viewed
as the lowest level of abstraction from which
information and knowledge are derived.
Data may be unstructured, semi-structured, and
structured
Example: The price of petrol is Rs. 48 per liter

Information Knowledge
Information is interpreted (processed) data so that it has Knowledge is a fluid mix of information, experience and
meaning for the user. insight that may benefit the individual or the
“The price of petrol has risen from Rs. 43 to Rs. 48 per organization.
liter” – is information for a pperson who tracks ppetrol When petrol prices go up by Rs.
“When Rs 5 per liter,
liter it is likely
prices. that bus fare will rise by 10%" is knowledge.
Data becomes information when it is processed for some The boundaries between data, information, and
purpose and adds value for the recipient. knowledge is fuzzy
A set of raw sales figures – Data What is data to one person is information to someone
Sales report (chart plotting, trend analysis) – Information else.

1
7/22/2010

Data Mining, Text Mining and Web 8


Summarized View
Mining
Data are stored in Documents (A file)
Data – as in databases

Information – Processed data Unstructured Semi-structured Structured

knowledge is a meta information about the


A file stored on A web page A database
patterns hidden in the data your PC stored on WWW
10%

The patterns must be discovered automatically


Text Mining Web Mining Data Mining

Data Mining
Why Data Mining?
Main Objectives
Identification of data as a source of useful Data explosion problem
information
The Explosive Growth of Data: from terabytes to
petabytes
Use of discovered information for competitive
Automated data collection tools and mature database
advantages when working in business
technology lead to tremendous amounts of data
enviroment
stored in databases, datawarehouses and other
information repositories

Why Data Mining? (c.d.) Why DM? (c.d.)

Data explosion problem (c.d.) Data explosion problem (c.d.)


Major sources of abundant data We are drowning in data, but starving for knowledge!
Business: Web
Web, e
e-commerce,
commerce transactions
transactions, stocks
stocks, … Solution:
S l ti Data
D t warehousing
h i and
dDData Mining
t Mi i
Science: Remote sensing, bioinformatics, scientific
Extraction of interesting knowledge (rules, regularities,
simulation
patterns, constraints) from data in large databases
Society and everyone: news, digital cameras,

2
7/22/2010

The Huber Taxonomy of Data Set


Algorithmic Complexity
Sizes
Descriptor Data Set Size in Storage Mode Algorithm Complexity
Bytes
Plot a scatterplot O(n 1/2)

Tiny 102 Piece of Paper


Calculate means, variances, kernel densityy O(n)
S ll
Small 104 A Few
F Pieces
Pi off P
Paper estimates
Medium 106 A Floppy Disk
Calculate fast Fourier transforms O(n log(n))
Large 108 Hard Disk
Calculate singular value decomposition of an O(nc)
Huge 1010 Multiple Hard Disks, e.g. rc matrix; solve a multiple linear regression
RAID Storage
Massive 1012 Robotic Magnetic Tape, Solve most clustering algorithms O(n2)
Storage Silos

No. of Operations for Algorithms of Various


Computational Feasibility on a Pentium PC
Computational Complexities and various
10 MegaFLOPs Performance Assumed
Data Set Sizes

n n 1/2 n n log(n) n 3/2 n2


n n1/2 n n log(n) n3/2 n2 tiny 10 -6 10 -5 2x10 -5 .0001 .001
seconds seconds seconds seconds seconds
tinyy 10 102 2x102 103 104 small 10 -5 .001
001 .004
004 .11 10
seconds seconds seconds seconds seconds
small 102 104 4x104 106 108
medium .0001 .1 .6 1.67 1.16
seconds seconds seconds minutes days
medium 103 106 6x106 109 1012
large .001 10 1.3 1.16 31.7
large 104 108 8x108 1012 1016 seconds seconds minutes days years
huge .01 16.7 2.78 3.17 317,000
huge 105 1010 1011 1015 1020 seconds minutes hours years years

Computational Feasibility on a Silican Computational Feasibility on an Intel Paragon


Graphics Onyx Workstation XP/S A4
300 MegaFLOPs Performance Assumed 4.2 GigaFLOPs Performance Assumed

n n1/2 n n log(n) n3/2 n2 n n1/2 n n log(n) n3/2 n2


tiny 3.3x10-8 3.3x10-7 6.7x10-7 3.3x10-6 3.3x10-5
tiny 2.4x10-9 2.4x10-8 4.8x10-8 2.4x10-7 2.4x10-6
seconds seconds seconds seconds seconds
seconds seconds seconds seconds seconds
small 3.3x10-7 3.3x10-5 1.3x10-4 3.3x10-3 .33
seconds seconds seconds seconds seconds small 2.4x10-8 2.4x10-6 9.5x10-6 2.4x10-4 .024
seconds seconds seconds seconds seconds
-6 -3
medium 3.3x10 3.3x10 .02 3.3 55 -7 -4
seconds seconds seconds seconds minutes medium 2.4x10 2.4x10 .0014 .24 4.0
seconds seconds seconds seconds minutes
large 3.3x10-5 .33 2.7 55 1.04
-6
seconds seconds seconds minutes years large 2.4x10 .024 .19 4.0 27.8
seconds seconds seconds minutes days
huge 3.3x10-4 33 5.5 38.2 10,464
seconds seconds minutes days years huge 2.4x10-5 2.4 24 66.7 761
seconds seconds seconds hours years

3
7/22/2010

Computational Feasibility on a TeraFLOP


Types of Computers for Interactive Feasibility
Grand Challenge Computer
Response Time < 1 Second
1000 GigaFLOPs Performance Assumed

n n1/2 n n log(n) n3/2 n2


n n1/2 n n log(n) n 3/2 n2
tiny 10-11 10-10 2x10-10 10-9 10-8 tiny Personal Personal Personal Personal Personal
seconds seconds seconds seconds seconds C
Computer C
Computer C
Computer C
Computer C
Computer
small 10-10 10-8 4x10-8 10-6 10-4 small Personal Personal Personal Personal Super
seconds seconds seconds seconds seconds Computer Computer Computer Computer Computer

medium 10-9 10-6 6x10-6 .001 1 medium Personal Personal Personal Super Computer Teraflop
seconds seconds seconds seconds second Computer Computer Computer Computer

large -8
10 -4
10 8x10-4
1 2.8 large Personal Workstation Super Computer Teraflop ---
Computer Computer
seconds seconds seconds second hours
-7 huge Personal Super Teraflop --- ---
huge 10 .01 .1 16.7 3.2 Computer Computer Computer
seconds seconds seconds minutes years

Types of Computers for Feasibility Massive Data Sets:


Response Time < 1 Week Commonly Used Language

n n 1/2 n n log(n) n 3/2 n2


Data Mining = DM
tiny Personal
Computer
Personal
Computer
Personal
Computer
Personal
Computer
Personal
Computer Knowledge
g Discoveryy in Databases = KDD
small Personal
Computer
Personal
Computer
Personal
Computer
Personal
Computer
Personal
Computer Massive Data Sets = MD
medium Personal
Computer
Personal
Computer
Personal
Computer
Personal
Computer
Personal
Computer Data Analysis = DA
large Personal Personal Personal Personal Teraflop
Computer Computer Computer Computer Computer
huge Personal Personal Personal Super Computer ---
Computer Computer Computer

What is Data Mining? DM: Intuitive Definition

There are many activities with the same


Process to extract previously unknown
name: CONFUSSION
knowledge from large volumes of data
DM: Huge volumes of data
DM: Potential hidden knowledge
Requires both new technologies and
DM: Process of discovery of hidden methods
patterns in data

4
7/22/2010

Data Mining DM Some Applications

DM creates models (algorithms):


Classification
Target marketing, customer relation
Clustering management, market basket analysis,
Association cross selling
selling, market segmentation
Prediction
Forecasting, customer retention, quality
DM often presents the knowledge as a set of rules of the
form:
control, competitive analysis
IF.... THEN...
Finds other relationships in data
Detects deviations

DM Other Applications DM: Business Advantages

Other Applications Data Mining uses gathered data to


Text mining (news group, email, documents) Predicts tendencies and waves
and Web analysis.
y Classifies new data
Intelligent query answering Find previously unknown patterns
Scientific Applications Discover unknown relationships

DM: Technologies Data Mining vs Statistics


Many commercially available tools Some statistical methods are considered as a part of
Many methods (models, algorithms) for the same task Data Mining i.e. they are used as Data Mining
TOOLS ALONE ARE NOT THE SOLUTION algorithms, or as a part of Data Mining algorithms
The user must be able to interpret the results; one of the
requirements of DM is: Some, like statistical prediction methods of different
“the results must be easily comprehensible to the user” types of regression and clustering methods are now
Most often,especially when dealing with statistical considered as an integral part of Data Mining research
methods analysts are needed to interpret the knowledge – and applications
weakness of statistical methods.

5
7/22/2010

Fraud Detection and Management


Bussiness Applications (B1)

Buying patterns Applications


Fraud detection widely used in health care, retail, credit card
services, telecommunications (phone card
pp
Decision support fraud) etc
fraud), etc.
Medical aplications Approach
Marketing use historical data to build models of
fraudulent behavior and use data mining to
and more
help identify similar instances

Fraud Detection and Management Fraud Detection and Management


(B2) (B3)
Examples Detecting inappropriate medical treatment
auto insurance: detect characteristics of group Australian Health Insurance Commission detected that in
of people who stage accidents to collect on many cases blanket screening tests were requested
((save Australian $
$1m/yr).
y)
insurance
Detecting telephone fraud
money laundering: detect characteristics of
DM builds telephone call model: destination of the call,
suspicious money transactions (US Treasury's duration, time of day or week. Detects patterns that
Financial Crimes Enforcement Network) deviate from an expected norm.
medical insurance: detect characteristics of British Telecom identified discrete groups of callers with
fraudulent patients and doctors frequent intra-group calls, especially mobile phones, and
broke a multimillion dollar fraud.

Fraud Detection and Management


(B4) Data Mining vs Data Marketing

Retail Data Mining methods apply to many


domains
Analysts used Data Mining techniques to
estimate that 38%
% of retail shrink is due to Applications of Data Mining methods in
dishonest employees which the goal is to find buying patterns in
Transactional Data Bases has been named:
and more….
Data Marketing

6
7/22/2010

Market Analysis and Management Market Analysis and Management


(MA1) (MA2)
Where are the data sources for analysis? Determine customer purchasing
Credit card transactions, loyalty cards, discount patterns over time
coupons, customer complaint calls, plus (public)
lif t l studies
lifestyle t di Conversion of single to a joint bank account:
when marriage occurs, etc.
Target marketing
DM finds clusters of “model” customers who Cross-market analysis
share the same characteristics: interest, income Associations/co-relations between product sales
level, spending habits, etc.
Prediction based on the association information

Market Analysis and Management Corporate Analysis and Risk


(MA3) Management (CA1)
Customer profiling
Finance planning and asset evaluation
data mining can tell you what types of customers
cash flow analysis and prediction
buy what products (clustering or classification) contingent claim anal
analysis evaluate
sis to e al ate assets
Identifying customer requirements cross-sectional and time series analysis
(financial-ratio, trend analysis, etc.)
identifying the best products for different customers
Resource planning:
summarize and compare the resources and
spending

Corporate Analysis and


Risk Management (CA2) Business Summary

Data Mining helps to improve competitive


Competition: advantage of organizations in dynamically
monitor competitors and market directions changing environment; it improves clients
group
gro p ccustomers class-
stomers into classes and a class retention
t ti and d conversion
i
based pricing procedure
Different Data Mining methods are requiered
set pricing strategy in a highly competitive
for different kind of data and different kinds
market
of goals

7
7/22/2010

Scientific Applications Other Applications

Networks failure detection Sports


Controllers IBM Advanced Scout analyzed NBA game statistics
(shots blocked, assists, and fouls) to gain
g p
Geographic y
Information Systems competitive advantage for New York Knicks and
Genome- Bioinformatics Miami Heat

Intelligent robots Astronomy


etc… etc …. JPL and the Palomar Observatory discovered 22
quasars with the help of data mining
And more …..

Evolution of Database Technology


What is NOT Data Mining

Once the patterns are found Data Mining


1960s:
process is finished
Data collection, database creation, IMS and
The use of the patterns is not Data Mining
network DBMS
Queries to the database are not DM
1970s:
Relational data model, relational DBMS
implementation

Evolution of Database
Short History of Data Mining
Technology c.d.
1989 - KDD term (Knowledge Discovery in
1980s: Databases) appears in (IJCAI Workshop)
RDBMS, advanced data models (extended- 1991 - a collection of research papers edited by
Piatetsky-Shapiro
y p and Frawley y
relational OO
relational, OO, deductive
deductive, etc
etc.)) and
application-oriented DBMS (spatial, scientific, 1993 – Association Rule Mining Algorithm
APRIORI proposed by Agrawal, Imielinski and
engineering, etc.)
Swami.
1990s—2000s: 1996 – present: KDD evolves as a conjuction of
Data mining and data warehousing, different knowledge areas (data bases, machine
multimedia databases, and Web databases learning, statistics, artificial intelligence) and the
term Data Mining becomes popular

8
7/22/2010

Data Mining: Confluence of KDD process: Definition [Piatetsky-


Multiple Disciplines Shapiro 97]

Database KDD is a non trivial process for identification


Statistics
Technology of :
Valid
Machine New
Learning
Data Mining Visualization
Potentially useful
Understable
patterns in data
Information Other
Science Disciplines

The KDD process Steps of the KDD process

INTERPRETATION AND EVALUATION Preprocessing: includes all the operations that


have to be performed before a data mining
knowledge
DATA MINING algorithm is applied
((Chapter
p 3)
CODIFICATION Models
Data Mining: knowledge discovery algorithms
Transformed data
are applied in order to obtain the patterns
CLEANING
(Chapters 6, 7, and 8 )

SELECTION
Processed Data Interpretation: discovered patterns are
Target data
presented in a proper format and the user decides
if it is neccesary to re-iterate the algorthms
Data

DM: Data Mining KDD vs DM

DM is a step of the KDD process in which KDD is a term used by Academia


algorithms are applied to look for patterns in DM is a commercial term
data DM term is also being g used in Academia,,
It is necessary to apply first the as it has become a “brand name” for both
preprocessing operation to clean and KDD process and its DM sub-process
preprocess the data in order to obtain The important point is to see Data Mining as
significant patterns a process

9
7/22/2010

Architecture of a Typical Data Mining Data Mining: On What Kind of Data?


System
Graphical user interface Relational Databases
Data warehouses
Pattern evaluation Transactional databases
Data mining engine Advanced DB and information repositories
Object-oriented and object-relational databases
Knowledge-base
Database or data Spatial databases
warehouse server Time-series data and temporal data
Data cleaning & data integration Filtering
Text databases and multimedia databases
Data
Heterogeneous and legacy databases
Databases
Warehouse WWW

DM Functionalities (1) DM Functionalities (2)


Concept, class, description Concept characteristics
Concept – is defined semantically as any subset of records. Concept C characteristics is a set of attributes
We often define the concept by attribute c and its value v
a1, a2, … ak, and their respective values v1, v2,
In this case the concept description is syntactically written as …. vk that are characteristic for a given concept
: c=v and we define:
c , i.e.
i
CONCEPT={records: c=v}
For example: climate=wet (description of the concept) {records: a1=v1 & a2=v2&…..ak=vk}
CONCEPT={records: climate=wet} Characteristics description is then syntactically
We use word: CLASS, class attribute written as
for Concept, concept attribute a1=v1 & a2=v2&…..ak=vk

Characterization Discrimination

Describes the process which aim is to It is the process which aim is to find rules
find rules that describe properties of a that allow us to discriminate the objects
concept. They take the form (records) belonging to a given concept (one
class ) from the rest of records ( classes)
If concept then characteristics If characteristics then concept
A=0 & B=1 Æ C=1 33% 83% (support, confidence: the conditional
C=1 Æ A=1 & B=3 25% (support: there are 25% o the records for probability of the concept given the characteristics)
which the rule is true) A=2 & B=0 Æ C=1 27% 80%
C=1 Æ A=1 & B=4 17% A=1 & B=1 Æ C=1 12% 76%
C=1 Æ A=0 & B=2 16% Discriminant rule can be good even if it has a low support (and high
confidence)

10
7/22/2010

Data Mining Functionalities


Data Mining Functionalities (3) (4)
Prediction (statistical)
Classification and Prediction - Supervised - predict some unknown or missing numerical
learning
values
Finding models (rules) that describe (characterize) or/
and distinguish (discriminate) classes or concepts for C uste a
Cluster a ys s
analysis
future prediction Class label is unknown: Group data to form new
Example: classify countries based on climate classes- unsupervised learning
(characteristics), or classify cars based on gas For example: cluster houses to find distribution
mileage and use it to predict classification of a new patterns
car Clustering is based on the principle: maximizing the
Presentation: decision-tree, classification rules, intra-class similarity and minimizing the interclass
neural network, Bayes Network similarity

Data Mining Functionalities (5) Major Issues in Data Mining (1)

Mining methodology and user interaction


Outlier analysis Mining different kinds of knowledge in
Outlier: a data object that does not comply databases
with the general behavior of the data Interactive
I t ti mining
i i off knowledge
k l d att multiple
lti l
levels of abstraction
It can be considered as noise or exception
Incorporation of background knowledge
but is quite useful in fraud detection, rare
Data mining query languages and ad-hoc data
events analysis
mining
Expression and visualization of data mining
results

Major Issues in Data Mining (2) Major Issues in Data Mining (3)

Handling noise and incomplete data Issues relating to the diversity of data types
Handling relational and complex types of data
Pattern evaluation: the interestingness problem
Mining information from heterogeneous databases and
global information systems (WWW)
Performance and scalability Issues related to applications and social impacts
Efficiency and scalability of data mining Application of discovered knowledge
Domain-specific data mining tools
algorithms Intelligent query answering
Parallel, distributed and incremental Process control and decision making

mining methods Integration of the discovered knowledge with existing


knowledge: A knowledge fusion problem
Protection of data security, integrity, and privacy

11
7/22/2010

Aproaches (I)

Mathematics: Consist in the creation of


APPROACHES TO DATA mathematical models to extract rules,
MINING regularities and patterns (rough sets)

Statistics: They are focused in the creation


of statistical models to analyse data.
(bayesian networks)

Approaches (II)

Artificial Intelligence:
Classification trees (ID3, C4.5..)
Clustering

Neural Networks
Genetic algorithms
Visualization techniques
...

12

You might also like