0% found this document useful (0 votes)

60 views31 pages

Data Miningppt378

The document discusses data mining and knowledge discovery in databases. It defines data mining as the extraction of interesting, non-trivial patterns from large databases. The document outlines potential applications of data mining such as market analysis, risk analysis, and fraud detection. It also describes common data mining techniques like association rule mining, classification, and clustering.

Uploaded by

Komal Kiran

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPT, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

60 views31 pages

Data Miningppt378

Uploaded by

Komal Kiran

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPT, PDF, TXT or read online on Scribd

You are on page 1/ 31

Data Mining

Chapter 26

Chapter 1. Introduction

Motivation: Why data mining?

What is data mining?

Data Mining: On what kind of data?

Data mining functionality Are all the patterns interesting? Major issues in data mining
2

Motivation: Necessity is the Mother of Invention

Data explosion problem

Automated data collection tools and mature database technology lead to tremendous amounts of data stored in databases, data warehouses and other information repositories

We are drowning in data, but starving for knowledge! Solution: Data warehousing and data mining

Data warehousing and on-line analytical processing

Extraction of interesting knowledge (rules, regularities, patterns, constraints) from data in large databases
3

Evolution of Database Technology

1960s:

Data collection, database creation, IMS and network DBMS

1970s:

Relational data model, relational DBMS implementation

RDBMS, advanced data models (extended-relational, OO, deductive, etc.) and application-oriented DBMS (spatial, scientific, engineering, etc.) Data mining and data warehousing, multimedia databases, and Web databases
4

1980s:

1990s2000s:

What Is Data Mining?

Data mining (knowledge discovery in databases):

Extraction of interesting (non-trivial, implicit, previously unknown and potentially useful) information or patterns from data in large databases Data mining: a misnomer? Knowledge discovery(mining) in databases (KDD), knowledge extraction, data/pattern analysis, data archeology, data dredging, information harvesting, business intelligence, etc. (Deductive) query processing. Expert systems or small ML/statistical programs
5

Alternative names:

What is not data mining?

Why Data Mining? Potential Applications

Database analysis and decision support

Market analysis and management

target marketing, customer relation management, market basket analysis, cross selling, market segmentation Forecasting, customer retention, improved underwriting, quality control, competitive analysis

Risk analysis and management

Fraud detection and management Text mining (news group, email, documents) Stream data mining Web mining. DNA data analysis
6

Other Applications

Market Analysis and Management (1)

Where are the data sources for analysis?

Credit card transactions, loyalty cards, discount coupons, customer complaint calls, plus (public) lifestyle studies Find clusters of model customers who share the same characteristics: interest, income level, spending habits, etc. Conversion of single to a joint bank account: marriage, etc. Associations/co-relations between product sales Prediction based on the association information
7

Target marketing

Determine customer purchasing patterns over time

Cross-market analysis

Market Analysis and Management (2)

Customer profiling

data mining can tell you what types of customers buy what products (clustering or classification)

Identifying customer requirements

identifying the best products for different customers use prediction to find what factors will attract new customers

Provides summary information

various multidimensional summary reports statistical summary information (data central tendency and variation)
8

Corporate Analysis and Risk Management

Finance planning and asset evaluation

cash flow analysis and prediction contingent claim analysis to evaluate assets cross-sectional and time series analysis (financial-ratio, trend analysis, etc.) summarize and compare the resources and spending

Resource planning:

Competition:

monitor competitors and market directions group customers into classes and a class-based pricing procedure set pricing strategy in a highly competitive market
9

Fraud Detection and Management (1)

Applications

widely used in health care, retail, credit card services, telecommunications (phone card fraud), etc.

Approach

use historical data to build models of fraudulent behavior and use data mining to help identify similar instances
auto insurance: detect a group of people who stage accidents to collect on insurance money laundering: detect suspicious money transactions (US Treasury's Financial Crimes Enforcement Network) medical insurance: detect professional patients and ring of doctors and ring of references
10

Examples

Fraud Detection and Management (2)

Detecting inappropriate medical treatment

Australian Health Insurance Commission identifies that in many cases blanket screening tests were requested (save Australian $1m/yr). Telephone call model: destination of the call, duration, time of day or week. Analyze patterns that deviate from an expected norm. British Telecom identified discrete groups of callers with frequent intra-group calls, especially mobile phones, and broke a multimillion dollar fraud. Analysts estimate that 38% of retail shrink is due to dishonest employees.
11

Detecting telephone fraud

Retail

Other Applications

Sports

IBM Advanced Scout analyzed NBA game statistics (shots blocked, assists, and fouls) to gain competitive advantage for New York Knicks and Miami Heat JPL and the Palomar Observatory discovered 22 quasars with the help of data mining

Astronomy

Internet Web Surf-Aid

IBM Surf-Aid applies data mining algorithms to Web access logs for market-related pages to discover customer preference and behavior pages, analyzing effectiveness of Web marketing, improving Web site organization, etc.
12

Data Mining: A KDD Process

Pattern Evaluation

Data mining: the core of knowledge discovery Data Mining process.

Task-relevant Data Data Warehouse Selection

Data Cleaning
Data Integration Databases
13

Steps of a KDD Process

Learning the application domain:

relevant prior knowledge and goals of application

Creating a target data set: data selection Data cleaning and preprocessing: (may take 60% of effort!) Data reduction and transformation:

Find useful features, dimensionality/variable reduction, invariant representation. summarization, classification, regression, association, clustering.

Choosing functions of data mining

Choosing the mining algorithm(s) Data mining: search for patterns of interest Pattern evaluation and knowledge presentation

visualization, transformation, removing redundant patterns, etc.

Use of discovered knowledge

Data Mining: On What Kind of Data?

Relational databases Data warehouses Transactional databases Advanced DB and information repositories

Object-oriented and object-relational databases Spatial and temporal data Time-series data and stream data Text databases and multimedia databases Heterogeneous and legacy databases WWW
15

Data Mining Functionalities

Association Rule Mining

Association rule mining:

Finding frequent patterns, associations, correlations, or causal structures among sets of items or objects in transaction databases, relational databases, and other information repositories. Frequent pattern: pattern (set of items, sequence, etc.) that occurs frequently in a database What products were often purchased together? Beer and diapers?! What are the subsequent purchases after buying a PC? What kinds of DNA are sensitive to this new drug? Can we automatically classify web documents?
17

Motivation: finding regularities in data

Association Rule Mining (cont.)

Transaction-id 10 20 30 40 Items bought A, B, C A, C A, D B, E, F

Itemset X={x1, , xk}

Customer buys both

Customer buys diapers

Find all the rules XY with min confidence and support support, s, probability that a transaction contains XY confidence, c, conditional probability that a transaction having X also contains Y.

Customer buys beer

Let min_support = 50%, min_conf = 50%: A C (50%, 66.7%) C A (50%, 100%)

Mining Association Rulesan Example

Transaction-id 10 20 30 40 Items bought A, B, C A, C A, D B, E, F

Min. support 50% Min. confidence 50%

Frequent pattern {A} {B} {C} {A, C} Support 75% 50% 50% 50%

For rule A C:

support = support({A}{C}) = 50% confidence = support({A}{C})/support({A}) = 66.6%

Apriori: A Candidate Generation-and-test Approach

Any subset of a frequent itemset must be frequent

Apriori pruning principle: If there is any itemset which is infrequent, its superset should not be generated/tested! Method: generate length (k+1) candidate itemsets from length k frequent itemsets, and test the candidates against DB The performance studies show its efficiency and scalability

if {beer, diaper, nuts} is frequent, so is {beer, diaper} every transaction having {beer, diaper, nuts} also contains {beer, diaper}

The Apriori Algorithm An Example

Itemset sup 2 3 3 1 3

Database TDB
Tid
10 20 30 40

{A}

Itemset

sup 2

Items
A, C, D B, C, E A, B, C, E B, E Itemset {A, C} {B, C} sup 2 2

C1 1st scan

{B} {C} {D} {E}

{A}

{B}
{C} {E}

3
3 3

{B, E}
{C, E}

3
2

Itemset {A, B} {A, C} {A, E} {B, C} {B, E} {C, E}

sup 1 2 1 2 3 2

C2 2nd scan

Itemset {A, B} {A, C} {A, E}

{B, C}
{B, E} {C, E}

Itemset {B, C, E}

3rd scan

Itemset {B, C, E}

sup 2
21

The Apriori Algorithm

Pseudo-code: Ck: Candidate itemset of size k Lk : frequent itemset of size k

L1 = {frequent items}; for (k = 1; Lk !=; k++) do begin Ck+1 = candidates generated from Lk; for each transaction t in database do

increment the count of all candidates in Ck+1 that are contained in t Lk+1 = candidates in Ck+1 with min_support end return k Lk;
22

Important Details of Apriori

How to generate candidates?

Step 1: self-joining Lk Step 2: pruning

Example of Candidate-generation

L3={abc, abd, acd, ace, bcd} Self-joining: L3*L3 abcd from abc and abd acde from acd and ace
Pruning:

acde is removed because ade is not in L3 C4={abcd}

How to Generate Candidates?

Suppose the items in Lk-1 are listed in an order

Step 1: self-joining Lk-1

insert into Ck select p.item1, p.item2, , p.itemk-1, q.itemk-1 from Lk-1 p, Lk-1 q

where p.item1=q.item1, , p.itemk-2=q.itemk-2, p.itemk-1 < q.itemk1

Step 2: pruning
forall itemsets c in Ck do

forall (k-1)-subsets s of c do

if (s is not in Lk-1) then delete c from Ck

Classification and Prediction

Finding models (functions) that describe and distinguish classes or concepts for future prediction

E.g., classify countries based on climate, or classify cars based on gas mileage
Presentation: decision-tree, classification rule, neural network Prediction: Predict some unknown or missing numerical values
25

Classification Process: Model Construction

Training Data Classification Algorithms

NAME M ike M ary B ill Jim D ave A nne

RANK YEARS TENURED A ssistant P rof 3 no A ssistant P rof 7 yes P rofessor 2 yes A ssociate P rof 7 yes A ssistant P rof 6 no A ssociate P rof 3 no

Classifier (Model)

IF rank = professor OR years > 6 THEN tenured = yes

Classification Process: Use the Model in Prediction

Classifier Testing Data

Unseen Data

(Jeff, Professor, 4)
NAME RANK T om M erlisa G eorge Joseph A ssistant P rof A ssociate P rof P rofessor A ssistant P rof YEARS TENURED 2 7 5 7 no no yes yes

Tenured?

Decision Trees
age <=30 <=30 3140 >40 >40 >40 3140 <=30 <=30 >40 <=30 3140 3140 >40 income high high high medium low low low medium low medium medium medium high medium student no no no no yes yes yes no yes yes yes no yes no credit_rating fair excellent fair fair fair excellent excellent fair fair fair excellent excellent fair excellent
28

Training set

Output: A Decision Tree for buys_computer

age? <=30 student? no yes overcast 30..40 yes >40 credit rating? excellent fair

yes

yes
29

Cluster and outlier analysis

Cluster analysis

Class label is unknown: Group data to form new classes, e.g., cluster houses to find distribution patterns Clustering based on the principle: maximizing the intra-class similarity and minimizing the interclass similarity

Outlier analysis

Outlier: a data object that does not comply with the general behavior of

the data

It can be considered as noise or exception but is quite useful in fraud detection, rare events analysis

Clusters and Outliers

Table Maintenance Generator and Its Events
No ratings yet
Table Maintenance Generator and Its Events
11 pages
Data Mining: by P.Tejesh Reddy
No ratings yet
Data Mining: by P.Tejesh Reddy
28 pages
DWDM
No ratings yet
DWDM
30 pages
DWM
No ratings yet
DWM
66 pages
Data Mining: Concepts and Techniques: - Chapter 1
No ratings yet
Data Mining: Concepts and Techniques: - Chapter 1
37 pages
Module 3
No ratings yet
Module 3
187 pages
Datamining 1
No ratings yet
Datamining 1
30 pages
1 Intro
No ratings yet
1 Intro
33 pages
Introduction Lecture1gghhhhh
No ratings yet
Introduction Lecture1gghhhhh
23 pages
Introduction
No ratings yet
Introduction
46 pages
Lecture 01 11jan
No ratings yet
Lecture 01 11jan
29 pages
Data Mining: Concepts and Techniques
100% (2)
Data Mining: Concepts and Techniques
27 pages
Data Mining - 2
No ratings yet
Data Mining - 2
16 pages
Data Mining: Concepts and Techniques
No ratings yet
Data Mining: Concepts and Techniques
27 pages
Chap 1
No ratings yet
Chap 1
45 pages
To Data Mining: Motivation: "Necessity Is The Mother of Invention"
No ratings yet
To Data Mining: Motivation: "Necessity Is The Mother of Invention"
14 pages
Data Mining Notes
100% (1)
Data Mining Notes
45 pages
Chapter 6 Data Mining
No ratings yet
Chapter 6 Data Mining
39 pages
Data Mining
No ratings yet
Data Mining
63 pages
Data Mining:: Dr. Hany Saleeb
No ratings yet
Data Mining:: Dr. Hany Saleeb
37 pages
Introduction
No ratings yet
Introduction
27 pages
Data Mining and Data Warehouse BY: Dept. of Computer Science Engineering
No ratings yet
Data Mining and Data Warehouse BY: Dept. of Computer Science Engineering
10 pages
Data Mining
No ratings yet
Data Mining
88 pages
Major Issues in Data Mining
75% (4)
Major Issues in Data Mining
45 pages
Intro of Data Mining
No ratings yet
Intro of Data Mining
27 pages
DWM Unit II
No ratings yet
DWM Unit II
76 pages
Motivation For Data Mining The Information Crisis
No ratings yet
Motivation For Data Mining The Information Crisis
13 pages
Introduction To Data Mining
No ratings yet
Introduction To Data Mining
46 pages
CSM6404 DM L1
No ratings yet
CSM6404 DM L1
29 pages
Topic10 - Data Mining
No ratings yet
Topic10 - Data Mining
29 pages
Data Mining: Concepts and Techniques
No ratings yet
Data Mining: Concepts and Techniques
27 pages
Data Mining Concept (MMU)
No ratings yet
Data Mining Concept (MMU)
38 pages
Prof. Chandan Singhavi
No ratings yet
Prof. Chandan Singhavi
86 pages
Chap 1
No ratings yet
Chap 1
32 pages
Data Mining: An Overview From A Database Perspective
No ratings yet
Data Mining: An Overview From A Database Perspective
30 pages
Chapter 1 Data Mining Lecture Note
No ratings yet
Chapter 1 Data Mining Lecture Note
31 pages
CH 1
No ratings yet
CH 1
66 pages
Data Mining: Concepts and Techniques
No ratings yet
Data Mining: Concepts and Techniques
25 pages
Chapter 1. Introduction
No ratings yet
Chapter 1. Introduction
323 pages
Lecture 1.1.1 1.1.2
No ratings yet
Lecture 1.1.1 1.1.2
32 pages
Lecture 1-Introduction To Data Mining - M
No ratings yet
Lecture 1-Introduction To Data Mining - M
38 pages
Chapter - 1
No ratings yet
Chapter - 1
22 pages
Data Mining
No ratings yet
Data Mining
27 pages
Chapter 5 - Data Mining
No ratings yet
Chapter 5 - Data Mining
29 pages
Data Mining Concepts
No ratings yet
Data Mining Concepts
35 pages
01 Intro
No ratings yet
01 Intro
23 pages
L1 CH 1 Introd
No ratings yet
L1 CH 1 Introd
97 pages
1 Chapter One
No ratings yet
1 Chapter One
54 pages
IS352 - Lecture 01
No ratings yet
IS352 - Lecture 01
62 pages
DM Introduction
No ratings yet
DM Introduction
32 pages
Data Mining - GDi Techno Solutions
No ratings yet
Data Mining - GDi Techno Solutions
145 pages
Unit 1
No ratings yet
Unit 1
59 pages
Introduction To Data Mining
No ratings yet
Introduction To Data Mining
69 pages
DWDM - Unit - II
No ratings yet
DWDM - Unit - II
55 pages
02-Introduction To Data Mining
No ratings yet
02-Introduction To Data Mining
40 pages
Introduction To Data Mining-Week1
No ratings yet
Introduction To Data Mining-Week1
43 pages
1 - Lect 1 & 2 Data Mining
No ratings yet
1 - Lect 1 & 2 Data Mining
20 pages
CH 2
No ratings yet
CH 2
37 pages
Data Mining and Decision Trees: Prof. Sin-Min Lee Department of Computer Science
No ratings yet
Data Mining and Decision Trees: Prof. Sin-Min Lee Department of Computer Science
66 pages
01 - Introduction To Datamining
No ratings yet
01 - Introduction To Datamining
19 pages
Data Mining: Fundamentals and Applications
From Everand
Data Mining: Fundamentals and Applications
Fouad Sabry
No ratings yet
Review of Related Literature/Studies/Systems: Pattern of Internet Usage in Cyber Cafés in Manila: An Exploratory Study
No ratings yet
Review of Related Literature/Studies/Systems: Pattern of Internet Usage in Cyber Cafés in Manila: An Exploratory Study
8 pages
Current Log
No ratings yet
Current Log
32 pages
Snowflake Data Sharing
No ratings yet
Snowflake Data Sharing
16 pages
Introduction To Software Engineering: Engr. Hafiza Sundus Waleed
No ratings yet
Introduction To Software Engineering: Engr. Hafiza Sundus Waleed
27 pages
DBMS Lab
No ratings yet
DBMS Lab
16 pages
ORacle DBA Brochure
No ratings yet
ORacle DBA Brochure
9 pages
Skill Week-13 - Java Database Connectivity Servlet API and JDBC
No ratings yet
Skill Week-13 - Java Database Connectivity Servlet API and JDBC
2 pages
Synopsis-Bank Management (Final) PDF
No ratings yet
Synopsis-Bank Management (Final) PDF
73 pages
Golang Backend Development Roadmap
No ratings yet
Golang Backend Development Roadmap
15 pages
Smartyouth Savings and Credit System Project Report
No ratings yet
Smartyouth Savings and Credit System Project Report
24 pages
Spring Boot PDF Notes
0% (1)
Spring Boot PDF Notes
11 pages
1.1 Brief Introduction:: Management System Provides An Online Portal That Is Beneficial For The Client
No ratings yet
1.1 Brief Introduction:: Management System Provides An Online Portal That Is Beneficial For The Client
4 pages
TC-MATRIX 240 Operation Manual
No ratings yet
TC-MATRIX 240 Operation Manual
97 pages
Land Record Maintaining in Karnataka
100% (1)
Land Record Maintaining in Karnataka
155 pages
Arslan Resume
No ratings yet
Arslan Resume
3 pages
Ub-04 Chars Manual
No ratings yet
Ub-04 Chars Manual
82 pages
RDBMS
No ratings yet
RDBMS
6 pages
MIDTERM
No ratings yet
MIDTERM
20 pages
Software Development Laboratory Hospital Management
No ratings yet
Software Development Laboratory Hospital Management
88 pages
Research Paper Is at Dominos
No ratings yet
Research Paper Is at Dominos
6 pages
B.3. Information Systems and Information Technology Planning Phases
No ratings yet
B.3. Information Systems and Information Technology Planning Phases
3 pages
Online Shoping System Reserch
No ratings yet
Online Shoping System Reserch
29 pages
Chatbot: An Intelligent Agent For Enterprise Professionals
No ratings yet
Chatbot: An Intelligent Agent For Enterprise Professionals
28 pages
Project Name - (Shareplate) - Connecting Communities Through Food
No ratings yet
Project Name - (Shareplate) - Connecting Communities Through Food
6 pages
Text Data Mining: A Case Study: Charles Wesley Ford, Chia-Chu Chiang, Hao Wu, Radhika R. Chilka, and John R. Talburt
No ratings yet
Text Data Mining: A Case Study: Charles Wesley Ford, Chia-Chu Chiang, Hao Wu, Radhika R. Chilka, and John R. Talburt
6 pages
Software Requirements Specification: Prepared By: Team Error Mohammad Zaheer, Mohammad Waris, Sayed Mustafa
No ratings yet
Software Requirements Specification: Prepared By: Team Error Mohammad Zaheer, Mohammad Waris, Sayed Mustafa
11 pages
MNSF
No ratings yet
MNSF
17 pages
Syllabus BBA 4 SEM 01
No ratings yet
Syllabus BBA 4 SEM 01
9 pages
Address Book PDF
No ratings yet
Address Book PDF
20 pages

Data Miningppt378

Uploaded by

Data Miningppt378

Uploaded by

Data Mining

Motivation: Why data mining?

Data Mining: On what kind of data?

Motivation: Necessity is the Mother of Invention

Data explosion problem

Data warehousing and on-line analytical processing

Evolution of Database Technology

Data collection, database creation, IMS and network DBMS

Relational data model, relational DBMS implementation

What Is Data Mining?

Data mining (knowledge discovery in databases):

What is not data mining?

Why Data Mining? Potential Applications

Database analysis and decision support

Market analysis and management

Risk analysis and management

Market Analysis and Management (1)

Where are the data sources for analysis?

Determine customer purchasing patterns over time

Market Analysis and Management (2)

Identifying customer requirements

Provides summary information

Corporate Analysis and Risk Management

Finance planning and asset evaluation

Fraud Detection and Management (1)

Fraud Detection and Management (2)

Detecting inappropriate medical treatment

Detecting telephone fraud

Internet Web Surf-Aid

Data Mining: A KDD Process

Data mining: the core of knowledge discovery Data Mining process.

Steps of a KDD Process

Learning the application domain:

relevant prior knowledge and goals of application

Choosing functions of data mining

visualization, transformation, removing redundant patterns, etc.

Use of discovered knowledge

Data Mining: On What Kind of Data?

Data Mining Functionalities

Association Rule Mining

Association rule mining:

Motivation: finding regularities in data

Association Rule Mining (cont.)

Itemset X={x1, , xk}

Customer buys both

Customer buys diapers

Customer buys beer

Let min_support = 50%, min_conf = 50%: A C (50%, 66.7%) C A (50%, 100%)

Mining Association Rulesan Example

Min. support 50% Min. confidence 50%

support = support({A}{C}) = 50% confidence = support({A}{C})/support({A}) = 66.6%

Apriori: A Candidate Generation-and-test Approach

Any subset of a frequent itemset must be frequent

The Apriori Algorithm An Example

{B} {C} {D} {E}

Itemset {A, B} {A, C} {A, E} {B, C} {B, E} {C, E}

Itemset {A, B} {A, C} {A, E}

The Apriori Algorithm

Important Details of Apriori

How to generate candidates?

Step 1: self-joining Lk Step 2: pruning

acde is removed because ade is not in L3 C4={abcd}

How to Generate Candidates?

Suppose the items in Lk-1 are listed in an order

Step 1: self-joining Lk-1

where p.item1=q.item1, , p.itemk-2=q.itemk-2, p.itemk-1 < q.itemk1

if (s is not in Lk-1) then delete c from Ck

Classification and Prediction

Classification Process: Model Construction

NAME M ike M ary B ill Jim D ave A nne

IF rank = professor OR years > 6 THEN tenured = yes

Classification Process: Use the Model in Prediction

Output: A Decision Tree for buys_computer

Cluster and outlier analysis

Clusters and Outliers

You might also like