0% found this document useful (0 votes)
4 views23 pages

Topic 1c - Tasks & Techniques

The document provides an overview of data mining, including its objectives, tasks, and techniques. It discusses various data mining tasks such as classification, clustering, and association rule discovery, along with their applications in fields like marketing and fraud detection. Additionally, it highlights the importance of understanding data relationships and the methods used to analyze and predict trends from data.

Uploaded by

2024793147
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
4 views23 pages

Topic 1c - Tasks & Techniques

The document provides an overview of data mining, including its objectives, tasks, and techniques. It discusses various data mining tasks such as classification, clustering, and association rule discovery, along with their applications in fields like marketing and fraud detection. Additionally, it highlights the importance of understanding data relationships and the methods used to analyze and predict trends from data.

Uploaded by

2024793147
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 23

Topic 1c:

Task and
Techniques
of Data
Mining
Ts. Dr. Tuan Norhafizah Tuan Zakaria
Objectives

To introduce about To discuss the history, To discuss Data Mining


Data Mining and its evolution and techniques, tasks,
relationship with data motivation of Data applications and some
and knowledge Mining major issues
Knowledge Discovery
in Databases

DM: Tasks and Techniques Data mining

Tasks

Techniques

Tasks Techniques
• Classification • Decision Trees
• Clustering • Association Rule
• Association Rules • k-means
• Prediction • Neural Networks
• Sequential Analysis • Naïve Bayes
• Deviation analysis • k-nearest neighbor
• Similarity analysis • Statistical Method
• Trend analysis
Given a collection of records (training set )
• Each record contains a set of attributes, one of the
attributes is the class.

Classificati Find a model for class attribute as a


function of the values of other attributes.
on:
Definition Goal: previously unseen records should be
assigned a class as accurately as possible.
• A test set is used to determine the accuracy of the
model. Usually, the given data set is divided into
training and test sets, with training set used to build
the model and test set used to validate it.
Classification Example
l l us
ir ca ir ca uo
go go ti n
te te n s
ca ca co lc as
Tid Refund Marital Taxable Refund Marital Taxable
Status Income Cheat Status Income Cheat

1 Yes Single 125K No No Single 75K ?


2 No Married 100K No Yes Married 50K ?
3 No Single 70K No No Married 150K ?
4 Yes Married 120K No Yes Divorced 90K ?
5 No Divorced 95K Yes No Single 40K ?
6 No Married 60K No No Married 80K ? Test
Set
10

7 Yes Divorced 220K No


8 No Single 85K Yes
9 No Married 75K No Training Learn
Set Classifier Model
10 No Single 90K Yes
10
Classification: Direct Marketing

Goal: Reduce cost of mailing by targeting a set of


consumers likely to buy a new cell-phone product.

Approach:
We know Collect various demographic, lifestyle,
Identify which customers decided to buy and
and company-interaction related information, Use this information as input attributes to learn a
which decided otherwise. This {buy, don’t buy}
type of business, where they stay, how much classifier model.
decision forms the class attribute.
they earn, etc.
Classification: Customer Attrition/Churn

Goal: To predict whether a customer is likely to be lost to a


competitor.

Approach:
How often the customer calls,
Use detailed record of transactions where he calls, what time-of-the day Label the customers as loyal or
Find a model for loyalty.
(past and present customers he calls most, his financial status, disloyal.
marital status, etc.
Given a set of data points, each having a
set of attributes, and a similarity measure
among them, find clusters such that

• Data points in one cluster are more similar to one


another.

Clusterin • Data points in separate clusters are less similar to one


another.

g
Similarity Measures:

• Euclidean Distance if attributes are continuous.


• Other Problem-specific Measures.
Clustering: Euclidean Distance
Based Clustering in 3-D space.

Intracluster Intercluster
distances distances
are minimized are maximized
Clustering: Market Segmentation

Goal: subdivide a market into distinct subsets of customers


where any subset may conceivably be selected as a market target
to be reached with a distinct marketing mix.

2. Approach:
Collect different attributes of customers Measure the clustering quality by observing
based on their geographical and lifestyle Find clusters of similar customers. buying patterns of customers in same cluster
related information. vs. those from different clusters.
Clustering: Market Segmentation
Segment 1: high duration
Segment 2: moderate
but low number of
duration of generated calls
generated calls and
and moderate to high data
moderate number of sent
usage.
and received SMS.

Segment 3: high duration of


off-net calls, high number Segment 4: very low call
of generated calls, and duration, high sent and
moderate to low of both received SMS, and high
duration of generated calls data usage.
and data usage.

Segment 5: very low data


usage, low duration of
Segment 6: relatively high
generated calls, and high
duration of international
number of received calls
calls.
with respect to the number
of generated calls.
Clustering: Document Clustering

Goal: To find groups of documents that are similar to each


other based on the important terms appearing in them.

2. Approach:
To identify frequently occurring terms in each document. Gain: Information Retrieval can utilize the clusters to
Form a similarity measure based on the frequencies of relate a new document or search term to clustered
different terms. Use it to cluster. documents.
Association
Rule TID Items
Discovery 1 Bread, Coke, Milk
2 Beer, Bread
3 Beer, Coke, Diaper, Milk
• Given a set of records each of
which contain some number of 4 Beer, Bread, Diaper, Milk
items from a given collection; 5 Coke, Diaper, Milk
• Produce dependency rules
which will predict occurrence
of an item based on Rules
RulesDiscovered:
Discovered:
occurrences of other items. {Milk}
{Milk}-->
-->{Coke}
{Coke}
{Diaper,
{Diaper,Milk}
Milk}-->
-->{Beer}
{Beer}
Association Rule
Discovery:
Marketing & Sales
Promotion
• Let the rule discovered be
{Bagels, … } --> {Potato Chips}
• Potato Chips as consequent can be used to
determine what should be done to boost its sales.
• Bagels in the antecedent Can be used to see which
products would be affected if the store discontinues
selling bagels.
• Bagels in antecedent and Potato chips in consequent
can be used to see what products should be sold
with Bagels to promote sale of Potato chips!
Goal: To identify items that are bought
Association together by sufficiently many customers.

Rule Approach:
Discovery: • Process the point-of-sale data collected with barcode
Supermark scanners to find dependencies among items.

et Shelf A classic rule


Manageme • If a customer buys diaper and milk, then he is very
nt likely to buy rootbeer.
• So, don’t be surprised if you find six-packs of rootbeer
stacked next to diapers!
Retail
Analytics

https://www.digitalnewsasia.com/download/tapwaycasestudy.pdf
Regression

Predict a value of a given


continuous valued variable
Greatly studied in statistics,
based on the values of
and machine learning Examples:
other variables, assuming a
fields.
linear or nonlinear model
of dependency.

Predicting sales amounts of Predicting wind velocities as


Time series prediction of
new product based on a function of temperature,
stock market indices.
advertising expenditure. humidity, air pressure, etc.
Deviation Analysis
Discovering most significant changes in data from previously measured or normative
values
• Usually, categorical separately from other data mining tasks

Deviations are often infrequent

Modifications of classification, clustering, time series analysis can be used as a means


to achieve the goal

Outlier detection in statistics


Detect significant deviations from
Deviation normal behavior.

Analysis:
Anomaly Applications:
Detection • Credit card fraud detection
• Network intrusion detection

Typical network traffic at University level may reach over 100 million connections per day
Deviation Analysis: Fraud Detection

Compare employee home


Identify employee accounts at addresses, social security numbers,
financial institutions that have telephone numbers and bank
excess numbers of credit memos. routing and account numbers to
Excess credit memos can indicate those of vendors from vendor
diversion of funds into employee master file. This test can reveal
accounts. bogus or improperly selected vendor
accounts.
Deviation Analysis: Fraud Detection

https://www.insurancebusinessmag.com/asia/news/breaking-news/malaysias-antifraud-system-operational-by-october-74933.aspx
Profiteering Cases

https://www.freemalaysiatoday.com/category/nation/2018/08/25/yes-keep-receipts-to-fight-profit
eering-say-retailers/

Yes, keep receipts to fight profiteering, say retailers


Robin Augustin -August 25, 2018 8:00 AM
http://english.astroawani.com/malaysia-news/gst-1-256-profiteering-
cases-detected-1-115-notices-issued-till-june-5-61853
References

1. Tan, Steinbach, Karpatne, Kumar, Lecture Notes, Chapter 1, Introduction to Data


Mining, 2nd Edition, 2018
2. Pang-Ning Tan, Michael Steinbach & Vipin Kumar, Introduction to Data Mining,
Addison Wesley, 2019.
3. Jiawei Han and Micheline Kamber, Data Mining: Concepts and Techniques, 3rd
Edition, Morgan Kaufmann, 2012.
4. Coenen, F. Data mining: past, present and future. Knowledge Engineering Review,
26(1), 25-29, 2011
5. Gregory Piatetsky-Shapiro, Data Science: Past, Present, and Future KDnuggets 1©
Kdnuggets, 2016

You might also like