0% found this document useful (0 votes)
87 views24 pages

Topic 1c - Tasks and Techniques of DM

Uploaded by

syazaqilah
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
87 views24 pages

Topic 1c - Tasks and Techniques of DM

Uploaded by

syazaqilah
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 24

TOPIC 1 – PART 3

TASKS AND TECHNIQUES


OF DATA MINING
OBJECTIVES
To introduce about Data Mining (DM) and its
relationship with data and knowledge

To discuss the history, evolution and motivation of DM

To discuss DM tasks, techniques, applications ✅ and


some major issues
DATA MINING: TASKS and TECHNIQUES
TASKS include; TECHNIQUES include;
Classification Decision Trees
Clustering
Association Rule Knowledge Discovery
Association Rules in Databases
k-means
Prediction
Neural Networks Data mining
Sequential Analysis
Naïve Bayes
Deviation analysis Tasks
k-nearest neighbor
Similarity analysis
Techniques
Trend analysis Statistical Method
CLASSIFICATION: DEFINITION
Given a collection of records (training set )
• Each record contains a set of attributes, one of the attributes is the class.

Find a model for class attribute as a function of the values of other


attributes.

Goal: previously unseen records should be assigned a class as accurately


as possible.
• A test set is used to determine the accuracy of the model. Usually, the given
data set is divided into training and test sets, with training set used to build the
model and test set used to validate it.
CLASSIFICATION EXAMPLE
l l us
rica rica o
ego ego ti nu
t t n s
ca ca co lc as
Tid Refund Marital Taxable Refund Marital Taxable
Status Income Cheat Status Income Cheat

1 Yes Single 125K No No Single 75K ?


2 No Married 100K No Yes Married 50K ?
3 No Single 70K No No Married 150K ?
4 Yes Married 120K No Yes Divorced 90K ?
5 No Divorced 95K Yes No Single 40K ?
6 No Married 60K No No Married 80K ? Test
Set
10

7 Yes Divorced 220K No


8 No Single 85K Yes
9 No Married 75K No Training Learn
Set Classifier Model
10 No Single 90K Yes
10
CLASSIFICATION: APPLICATION 1
DIRECT MARKETING

1. Goal: Reduce cost of mailing by targeting a set of consumers likely to


buy a new cell-phone product.

2. Approach:
• We know Collect various demographic, lifestyle, and company-interaction
related information, type of business, where they stay, how much they earn,
etc.
• Identify which customers decided to buy and which decided otherwise. This
{buy, don’t buy} decision forms the class attribute.
• Use this information as input attributes to learn a classifier model.
CLASSIFICATION: APPLICATION 2
CUSTOMER ATTRITION/CHURN

1. Goal: To predict whether a customer is likely to be lost to a


competitor.

2. Approach:
• Use detailed record of transactions (past and present customers
• How often the customer calls, where he calls, what time-of-the day he calls
most, his financial status, marital status, etc.
• Label the customers as loyal or disloyal.
• Find a model for loyalty.
CLUSTERING DEFINITION

Given a set of data points, each having a set of attributes,


and a similarity measure among them, find clusters such that
• Data points in one cluster are more similar to one another.
• Data points in separate clusters are less similar to one another.

Similarity Measures:
• Euclidean Distance if attributes are continuous.
• Other Problem-specific Measures.
ILLUSTRATING CLUSTERING
 Euclidean Distance Based Clustering in 3-D space.

Intracluster distances Intercluster distances


are minimized are maximized
CLUSTERING: APPLICATION 1
MARKET SEGMENTATION

1. Goal: subdivide a market into distinct subsets of customers where any


subset may conceivably be selected as a market target to be reached with a
distinct marketing mix.

2. Approach:
• Collect different attributes of customers based on their geographical and
lifestyle related information.
• Find clusters of similar customers.
• Measure the clustering quality by observing buying patterns of customers in
same cluster vs. those from different clusters.
CLUSTERING: APPLICATION 1 – MARKET SEGMENTATION

Segment 1: high duration but low number of generated calls and moderate number
of sent and received SMS. Segment 2: moderate duration of generated calls and
moderate to high data usage.

Segment 3: high duration of off-net calls, high number of generated calls, and
moderate to low of both duration of generated calls and data usage.

Segment 4: very low call duration, high sent and received SMS, and high data usage.

Segment 5: very low data usage, low duration of generated calls, and high number of
received calls with respect to the number of generated calls. Segment 6: relatively
high duration of international calls.

Market Segmentation: https://online-journals.org/index.php/i-jim/article/download/4392/3606


CLUSTERING: APPLICATION 2

DOCUMENT CLUSTERING

1. Goal: To find groups of documents that are similar to each other


based on the important terms appearing in them.

2. Approach:
• To identify frequently occurring terms in each document. Form a similarity
measure based on the frequencies of different terms. Use it to cluster.
• Gain: Information Retrieval can utilize the clusters to relate a new
document or search term to clustered documents.
ASSOCIATION RULE DISCOVERY: DEFINITION
Given a set of records each of which contain some number of items from a given
collection;
• Produce dependency rules which will predict occurrence of an item based on
occurrences of other items.

TID Items
1 Bread, Coke, Milk
2 Beer, Bread Rules
RulesDiscovered:
Discovered:
{Milk}
{Milk}-->
-->{Coke}
{Coke}
3 Beer, Coke, Diaper, Milk {Diaper,
{Diaper,Milk}
Milk}-->
-->{Beer}
{Beer}
4 Beer, Bread, Diaper, Milk
5 Coke, Diaper, Milk
ASSOCIATION RULE DISCOVERY: APPLICATION 1

MARKETING AND SALES PROMOTION

• Let the rule discovered be


{Bagels, … } --> {Potato Chips}
• Potato Chips as consequent can be used to determine what
should be done to boost its sales.
• Bagels in the antecedent Can be used to see which products
would be affected if the store discontinues selling bagels.
• Bagels in antecedent and Potato chips in consequent can be
used to see what products should be sold with Bagels to
promote sale of Potato chips!
ASSOCIATION RULE DISCOVERY: APPLICATION 2

SUPERMARKET SHELF MANAGEMENT

1. Goal: To identify items that are bought together by sufficiently many


customers.

2. Approach:
• Process the point-of-sale data collected with barcode scanners to find
dependencies among items.
3. A classic rule
• If a customer buys diaper and milk, then he is very likely to buy rootbeer.
• So, don’t be surprised if you find six-packs of rootbeer stacked next to diapers!
RETAIL ANALYTICS
https://www.digitalnewsasia.com/download/tapwaycasestudy.pdf
REGRESSION

1. Predict a value of a given continuous valued variable based on the values of other
variables, assuming a linear or nonlinear model of dependency.
2. Greatly studied in statistics, and machine learning fields.
3. Examples:
• Predicting sales amounts of new product based on advertising expenditure.
• Predicting wind velocities as a function of temperature, humidity, air pressure,
etc.
• Time series prediction of stock market indices.
DEVIATION ANALYSIS

1. Discovering most significant changes in data from previously measured


or normative values
2. Usually categorical separately from other data mining tasks
3. Deviations are often infrequent
4. Modifications of classification, clustering, time series analysis can be
used as a means to achieve the goal
5. Outlier detection in statistics
DEVIATION ANALYSIS (ANOMALY DETECTION)

1. Detect significant deviations from normal behavior.


2. Applications:

Credit Card Fraud Detection Network Intrusion Detection

Typical network traffic at University level may reach over 100 million connections per day
DEVIATION ANALYSIS (FRAUD DETECTION)

1. Identify employee accounts at financial institutions that have excess numbers


of credit memos. Excess credit memos can indicate diversion of funds into
employee accounts.

2. Compare employee home addresses, social security numbers, telephone


numbers and bank routing and account numbers to those of vendors from
vendor master file. This test can reveal bogus or improperly selected vendor
accounts.
DEVIATION ANALYSIS (FRAUD DETECTION)

https://www.insurancebusinessmag.com/asia/news/breaking-news/malaysias-antifraud-system-operational-by-october-74933.aspx
PROFITEERING CASES

https://www.freemalaysiatoday.com/category/nation/2018/08/25/yes-keep-receipts-to-fight-profit
eering-say-retailers/

Yes, keep receipts to fight profiteering, say retailers


Robin Augustin -August 25, 2018 8:00 AM
http://english.astroawani.com/malaysia-news/gst-1-256-profiteering-
cases-detected-1-115-notices-issued-till-june-5-61853
REFERENCES

1. Tan, Steinbach, Karpatne, Kumar, Lecture Notes, Chapter 1, Introduction to Data Mining, 2 nd Edition, 2018
2. Pang-Ning Tan, Michael Steinbach & Vipin Kumar, Introduction to Data Mining, Addison Wesley, 2019.
3. Jiawei Han and Micheline Kamber, Data Mining: Concepts and Techniques, 3rd Edition, Morgan Kaufmann, 2012.
4. Coenen, F. Data mining: past, present and future. Knowledge Engineering Review, 26(1), 25-29, 2011
5. Gregory Piatetsky-Shapiro, Data Science: Past, Present, and Future KDnuggets 1© Kdnuggets, 2016
THANK YOU
Shuzlina Abdul Rahman | Sofianita Mutalib | Siti Nur Kamaliah Kamarudin

You might also like