0% found this document useful (0 votes)

10 views24 pages

Datamining ch1

Uploaded by

tofikmohammed471

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

10 views24 pages

Datamining ch1

Uploaded by

tofikmohammed471

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

You are on page 1/ 24

Chapter 1: Introduction

Credits to: Tan et al

1
What is Data Mining?
 Many Definitions
 Non-trivial extraction of implicit, previously
unknown and potentially useful information from
data
 Exploration & analysis, by automatic or
semi-automatic means, of
large quantities of data
in order to discover
meaningful patterns

2
Why Mine Data? Commercial Viewpoint
 Lots of data is being collected
and warehoused
 Web data, e-commerce
 purchases at department/
grocery stores
 Bank/Credit Card
transactions
 Computers have become cheaper and more
powerful
 Competitive Pressure is Strong
 Provide better, customized services for a
competitive advantage (e.g. in Customer
3 Relationship Management)
Why Mine Data? Scientific Viewpoint
 Data collected and stored at enormous speeds
(GB/hour)
 remote sensors on a satellite
 telescopes scanning the skies
 microarrays generating gene
expression data
 scientific simulations generating terabytes of data
 Traditional techniques infeasible
for raw data
 Data mining may help scientists
 in classifying and segmenting data
 in Hypothesis Formation

4
Mining Large Data Sets - Motivation
 There is often information "hidden" in the data
that is
not readily evident
 Human analysts may take weeks to discover
useful information
4,000,000
Total new disk (TB) since 1995

 Much of the data is never analyzed at all

3,500,000

3,000,000

2,500,000
The Data Gap
2,000,000

1,500,000

1,000,000

500,000

0
1995 1996 1997 1998 1999

5 Number of analysts since 1995

Origins of Data Mining
 Draws ideas from machine learning (pattern
recognition), statistics, and database systems
 Traditional Techniques
may be unsuitable due to
Statistics Machine Learning/
 Enormity of data
Pattern
 High dimensionality of data Recognition
 Heterogeneous,
distributed nature of data Data Mining

Database
systems

6
Data Mining Tasks
 Prediction Methods
 Use some variables to predict unknown or future
values of other variables.

 Description Methods
 Find human-interpretable patterns that describe
the data.

7
Data Mining Tasks...
 Classification [Predictive]
 Clustering [Descriptive]
 Association Rule Discovery [Descriptive]
 Regression [Predictive]
 Deviation Detection [Predictive]

8
Classification: Definition
 Given a collection of records (training set )
 Each record contains a set of attributes, one of
the attributes is the class.
 Find a model for class attribute as a function of
the values of other attributes.
 Goal: previously unseen records should be
assigned a class as accurately as possible.
 A test set is used to determine the accuracy of the
model. Usually, the given data set is divided into
training and test sets, with training set used to
build the model and test set used to validate it.

9
Classification: Example
Class
age income student credit_rating buys_computer
<=30 high no fair no
<=30 high no excellent no
31…40 high no fair yes
>40 medium no fair yes
>40 low yes fair yes
Training
>40 low yes excellent no set
31…40 low yes excellent yes
<=30 medium no fair no
<=30 low yes fair yes
>40 medium yes fair yes
<=30 medium yes excellent yes
31…40 medium no excellent
Test set
31…40 high yes fair
>40 medium no excellent

10
Classification: Example…
Model
age?

<=30 overcast
30..40 >40

student? yes credit rating?

no yes excellent fair

no yes no yes
Classification Application: Direct Marketing
 Goal: Reduce cost of mailing by targeting a set of
consumers likely to buy a particular product.
 Approach:
 Use the data for a similar product introduced before.
 We know which customers decided to buy and which
decided otherwise. This {buy, don’t buy} decision
forms the class attribute.
 Collect various demographic, lifestyle, and company-
interaction related information about all such
customers.
(e.g., type of business, where they stay, how much
they earn, etc.)
 Use this information as input attributes to learn a
classifier model.

12
Classification Application: Fraud Detection
 Goal: Predict fraudulent cases in certain (e.g.,
credit card) transactions.
 Approach:
 Use credit card transactions and the
information on its account-holder as
attributes.
(e.g., when does a customer buy, what does
he buy, how often he pays on time, etc.)
 Label past transactions as fraud or fair
transactions. This forms the class attribute.
 Learn a model for the class of the
transactions.
 Use this model to detect fraud by observing
credit card transactions on an account.
13
Classification Application: Customer Attrition
 Goal: To predict whether a customer is likely to
be lost to a competitor.
 Approach:
 Use detailed record of transactions with each
of the past and present customers, to find
attributes.
(e.g., how often the customer calls, where he
calls, what time-of-the day he calls most, his
financial status, marital status, etc.)
 Label the customers as loyal or disloyal.
 Find a model for loyalty.

14
Association Rule Discovery: Definition
 Given a set of records each of which contain some
number of items from a given collection;
 Produce dependency rules which will predict
occurrence of an item based on occurrences of
other
TID
items.
Items
1 Bread, Coke, Milk
2 Beer, Bread
3 Beer, Coke, Diaper, Milk
4 Beer, Bread, Diaper, Milk
5 Coke, Diaper, Milk

Rules
RulesDiscovered:
Discovered:
{Milk}
{Milk}-->
-->{Coke}
{Coke}
{Diaper,
{Diaper,Milk}
Milk}-->
-->{Beer}
{Beer}
15
Association Rule Discovery: Application 1
 Marketing and Sales Promotion:
 Let the rule discovered be
{Bread, … } --> {Potato Chips}
 Potato Chips as consequent => Can be used to
determine what should be done to boost its
sales.
 Bread in the antecedent => Can be used to see
which products would be affected if the store
discontinues selling Bread.
 Bread in antecedent and Potato chips in
consequent => Can be used to see what
products should be sold with Bread to promote
sale of Potato chips!

16
Association Rule Discovery: Application 2
 Supermarket shelf management.
 Goal: To identify items that are bought together
by sufficiently many customers.
 Approach: Process the point-of-sale data
collected with barcode scanners to find
dependencies among items.
 A classic rule --
 If a customer buys diaper and milk, then he
is very likely to buy beer.

17
Clustering: Definition
 Given a set of data points, each having a set
of attributes, and a similarity measure among
them, find clusters such that
 Data points in one cluster are more similar to one
another.
 Data points in separate clusters are less similar
to one another.
 Similarity Measures:
 Euclidean Distance if attributes are continuous.
 Other Problem-specific Measures.

18
Clustering Application: Market Segmentation
 Goal: subdivide a market into distinct subsets
of customers where any subset may
conceivably be selected as a market target to
be reached with a distinct marketing mix.
 Approach:
 Collect different attributes of customers
based on their geographical and lifestyle
related information.
 Find clusters of similar customers.
 Measure the clustering quality by observing
buying patterns of customers in same cluster
vs. those from different clusters.

19
Clustering Application: Document Clustering
 Goal: To find groups of documents that are
similar to each other based on the important
terms appearing in them.
 Approach:
 To identify frequently occurring terms in
each document.
 Form a similarity measure based on the
frequencies of different terms. Use it to
cluster.
 Gain: Information Retrieval can utilize the
clusters to relate a new document or search
term to clustered documents.
20
Regression
 Predict a value of a given continuous valued
variable based on the values of other variables,
assuming a linear or nonlinear model of
dependency.
 Greatly studied in statistics, neural network fields.
 Examples:
 Predicting sales amounts of new product based
on advertising expenditure.
 Predicting wind velocities as a function of
temperature, humidity, air pressure, etc.
 Time series prediction of stock market indices.

21
Deviation/Anomaly Detection
 Detect significant deviations from normal behavior
 Applications:
 Credit Card Fraud Detection

 Network Intrusion Detection

Typical network traffic at University level may reach over 100 million connections per day

22
What is (not) Data Mining?
What is not Data
Mining? What is Data Mining?
 Look up phone number  Predicting the future
in phone directory stock price of a
 Dividing the customers company using
of a company historical records
according to their  Group together similar
gender documents returned by
 Computing the total search engine
sales of a company according to their
 Query a Web search context
engine for particular  Monitoring the heart
information rate of a patient for
abnormalities
23
Challenges of Data Mining
 Scalability
 Dimensionality
 Complex and Heterogeneous Data
 Data Quality
 Data Ownership and Distribution
 Privacy Preservation
 Streaming Data

UNIT-04: Introduction To Data Mining: Data Mining Techniques KDD Process Association Rules.
No ratings yet
UNIT-04: Introduction To Data Mining: Data Mining Techniques KDD Process Association Rules.
40 pages
DM Consolidated
100% (1)
DM Consolidated
676 pages
Datamining 1
No ratings yet
Datamining 1
30 pages
3 Data Mining
No ratings yet
3 Data Mining
58 pages
INS2061 Introductions
No ratings yet
INS2061 Introductions
75 pages
Data Mining
No ratings yet
Data Mining
7 pages
Instructor:: Doaa Adil Mohamed Altayeb
No ratings yet
Instructor:: Doaa Adil Mohamed Altayeb
34 pages
Data Mining Slide
No ratings yet
Data Mining Slide
35 pages
Data Mining Concept (MMU)
No ratings yet
Data Mining Concept (MMU)
38 pages
Lect 1
No ratings yet
Lect 1
38 pages
Ch2 DTasks
No ratings yet
Ch2 DTasks
44 pages
Knowledge Discovery and Data Mining (KDD)
No ratings yet
Knowledge Discovery and Data Mining (KDD)
52 pages
2a. Basic Data Mining Techniques
No ratings yet
2a. Basic Data Mining Techniques
39 pages
Foundations of Data Science - Unit 3
No ratings yet
Foundations of Data Science - Unit 3
18 pages
CT075!3!2-DTM-Topic 8 - Introduction To Data Mining
No ratings yet
CT075!3!2-DTM-Topic 8 - Introduction To Data Mining
32 pages
L1 Intro
No ratings yet
L1 Intro
32 pages
Data Mining Introduction
No ratings yet
Data Mining Introduction
35 pages
CH 2
No ratings yet
CH 2
37 pages
Week 4 - Introduction To Data Mining and Data Mining Techniques
No ratings yet
Week 4 - Introduction To Data Mining and Data Mining Techniques
44 pages
Introduction
No ratings yet
Introduction
29 pages
02 - Data Mining
No ratings yet
02 - Data Mining
27 pages
Lecture 3 Data Mining
No ratings yet
Lecture 3 Data Mining
36 pages
Introduction To Data Mining
No ratings yet
Introduction To Data Mining
69 pages
Introduction Lecture1gghhhhh
No ratings yet
Introduction Lecture1gghhhhh
23 pages
Lecture Notes For Chapter 1 Introduction To Data Mining: by Tan, Steinbach, Kumar
No ratings yet
Lecture Notes For Chapter 1 Introduction To Data Mining: by Tan, Steinbach, Kumar
32 pages
Data Mining
No ratings yet
Data Mining
63 pages
DMlecture 1
No ratings yet
DMlecture 1
39 pages
Data Mining: Introduction: Lecture Notes For Chapter 1
No ratings yet
Data Mining: Introduction: Lecture Notes For Chapter 1
32 pages
Data Mining
No ratings yet
Data Mining
37 pages
Data Mining and Warehousing: - Module 1 - Introduction
No ratings yet
Data Mining and Warehousing: - Module 1 - Introduction
29 pages
UNIT 1 Introduction of Data Mining
No ratings yet
UNIT 1 Introduction of Data Mining
11 pages
Assignment Solution 074
No ratings yet
Assignment Solution 074
8 pages
Lecture Notes For Chapter 1 Introduction To Data Mining: by Tan, Steinbach, Kumar
No ratings yet
Lecture Notes For Chapter 1 Introduction To Data Mining: by Tan, Steinbach, Kumar
34 pages
Topic 1c - Tasks & Techniques
No ratings yet
Topic 1c - Tasks & Techniques
23 pages
3 DM
No ratings yet
3 DM
36 pages
Introduction To Data Mining
No ratings yet
Introduction To Data Mining
44 pages
Data Mining
No ratings yet
Data Mining
33 pages
Data Mining Concepts - Binary
No ratings yet
Data Mining Concepts - Binary
22 pages
Data Management
No ratings yet
Data Management
36 pages
Data Mining
No ratings yet
Data Mining
23 pages
Chap1 Intro
No ratings yet
Chap1 Intro
28 pages
Topic 1c - Tasks and Techniques of DM
No ratings yet
Topic 1c - Tasks and Techniques of DM
24 pages
CSM6404 DM L1
No ratings yet
CSM6404 DM L1
29 pages
Major Issues in Data Mining
75% (4)
Major Issues in Data Mining
45 pages
DM - Unit I-Updated
No ratings yet
DM - Unit I-Updated
65 pages
01-Introduction To Data Mining
No ratings yet
01-Introduction To Data Mining
43 pages
Data Mining: July 18, 2019 1
No ratings yet
Data Mining: July 18, 2019 1
41 pages
Data Mining, Data Wharehousing and Olap
No ratings yet
Data Mining, Data Wharehousing and Olap
33 pages
Knowledge Discovery & Data Mining
No ratings yet
Knowledge Discovery & Data Mining
30 pages
Wk. 1. Introduction (08.10.2020)
No ratings yet
Wk. 1. Introduction (08.10.2020)
30 pages
CPS 196.03: Information Management and Mining: Shivnath Babu
No ratings yet
CPS 196.03: Information Management and Mining: Shivnath Babu
30 pages
Data Mining: Nikita K Somaiya
No ratings yet
Data Mining: Nikita K Somaiya
19 pages
Lecture 3 Data Mining
No ratings yet
Lecture 3 Data Mining
30 pages
Data Mining - IMT Nagpur-Manish
No ratings yet
Data Mining - IMT Nagpur-Manish
82 pages
What Is Not Data Mining - Ex: Generation of Attendance Report (Of A Course) From Registration Cards. - Student Table (STD)
No ratings yet
What Is Not Data Mining - Ex: Generation of Attendance Report (Of A Course) From Registration Cards. - Student Table (STD)
33 pages
4 Datamining
No ratings yet
4 Datamining
90 pages
What Is Data Mining?
No ratings yet
What Is Data Mining?
17 pages
Chapter 6 Data Mining
No ratings yet
Chapter 6 Data Mining
39 pages
Data Mining Course Overview
No ratings yet
Data Mining Course Overview
38 pages
Data Mining Technique Using Weka Tool
No ratings yet
Data Mining Technique Using Weka Tool
21 pages
Question Bank For NN
No ratings yet
Question Bank For NN
6 pages
Chapter 1 Describe Artificial Intelligence Workloads and Considerations - Exam Ref AI-900 Microsoft Azure AI Fundamentals
No ratings yet
Chapter 1 Describe Artificial Intelligence Workloads and Considerations - Exam Ref AI-900 Microsoft Azure AI Fundamentals
24 pages
Rancang Bangun Aplikasi Data Mining Pada Penjualan Distro Bloods Berbasis Web Menggunakan Algoritma Apriori
No ratings yet
Rancang Bangun Aplikasi Data Mining Pada Penjualan Distro Bloods Berbasis Web Menggunakan Algoritma Apriori
8 pages
Data Mining Question Bank
No ratings yet
Data Mining Question Bank
4 pages
Chapter 5 - System Analysis and Design
No ratings yet
Chapter 5 - System Analysis and Design
32 pages
Trends in Management Information Systems
67% (3)
Trends in Management Information Systems
1 page
SEngenering Ch.3
No ratings yet
SEngenering Ch.3
32 pages
A Study On Artificial Intelligence in HCL Info System
No ratings yet
A Study On Artificial Intelligence in HCL Info System
67 pages
Data Mining and Sentiment Analysis: Discovering Emotional Patterns in Text Data
No ratings yet
Data Mining and Sentiment Analysis: Discovering Emotional Patterns in Text Data
8 pages
ML Question Papper
100% (1)
ML Question Papper
2 pages
Kenny-230724-Top 50 Data Science Projects
No ratings yet
Kenny-230724-Top 50 Data Science Projects
9 pages
Data Mining Unit-1
No ratings yet
Data Mining Unit-1
59 pages
KDD-Knowledge Discovery in Databases
No ratings yet
KDD-Knowledge Discovery in Databases
5 pages
CH 4
No ratings yet
CH 4
29 pages
Ch-4. Host and NTW Security
No ratings yet
Ch-4. Host and NTW Security
33 pages
Chapter 4 - IS 466 - Fall Semester 24-25
No ratings yet
Chapter 4 - IS 466 - Fall Semester 24-25
57 pages
Student Performance Analysis System Using Data Mining IJERTCONV5IS01025
No ratings yet
Student Performance Analysis System Using Data Mining IJERTCONV5IS01025
3 pages
Vacancy Details 10-12-2024.Xlsx Scope Guide List 2025
No ratings yet
Vacancy Details 10-12-2024.Xlsx Scope Guide List 2025
13 pages
Association
No ratings yet
Association
40 pages
Bagging and Boosting
No ratings yet
Bagging and Boosting
40 pages
Chapter 3Databa-WPS Office
No ratings yet
Chapter 3Databa-WPS Office
19 pages
Iris Pca
No ratings yet
Iris Pca
13 pages
Project
No ratings yet
Project
29 pages
Chapter 7 SAD
No ratings yet
Chapter 7 SAD
24 pages
Data Discretization vs. OLAP
No ratings yet
Data Discretization vs. OLAP
33 pages
HTML WPS Office
No ratings yet
HTML WPS Office
2 pages
Introduction: Data-Analytic Thinking
No ratings yet
Introduction: Data-Analytic Thinking
52 pages
.Trashed 1663273445 Adam Ext Grade
No ratings yet
.Trashed 1663273445 Adam Ext Grade
1 page
Data Warehouse Data Mining Lecture Plan
No ratings yet
Data Warehouse Data Mining Lecture Plan
1 page
M.SC Part II Syllabus
No ratings yet
M.SC Part II Syllabus
41 pages
Zhao Yue CV
No ratings yet
Zhao Yue CV
7 pages
Project Description1
No ratings yet
Project Description1
6 pages
1098 2174 1 SM
No ratings yet
1098 2174 1 SM
9 pages
Machine Learning Foundation
No ratings yet
Machine Learning Foundation
4 pages
A Comparative Study of Association Rule Algorithms For Course Recommender System in E-Learning
No ratings yet
A Comparative Study of Association Rule Algorithms For Course Recommender System in E-Learning
5 pages
Prediction of Poultry Yield Using Data Mining Techniques
No ratings yet
Prediction of Poultry Yield Using Data Mining Techniques
17 pages
Research and Case Analysis of Apriori Algorithm Based On Mining Frequent Item-Sets
No ratings yet
Research and Case Analysis of Apriori Algorithm Based On Mining Frequent Item-Sets
11 pages
Introduction to Robotics
From Everand
Introduction to Robotics
Swarnalata Verma
No ratings yet

Datamining ch1

Uploaded by

Datamining ch1

Uploaded by

Chapter 1: Introduction

Credits to: Tan et al

 Much of the data is never analyzed at all

5 Number of analysts since 1995

student? yes credit rating?

no yes excellent fair

 Network Intrusion Detection

You might also like