0% found this document useful (0 votes)
49 views44 pages

Introduction To Data Mining

This document provides an introduction and overview of a course on data mining. The course will cover topics such as data preprocessing, mining association rules, clustering algorithms, classification, and web/social network analysis. It lists recommended textbooks and readings. The document outlines pre-requisites for the course, including a background in databases, algorithms, and programming. It defines data mining as the process of discovering hidden patterns from large data sets to extract useful information.

Uploaded by

Muhammad Ramzan
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
49 views44 pages

Introduction To Data Mining

This document provides an introduction and overview of a course on data mining. The course will cover topics such as data preprocessing, mining association rules, clustering algorithms, classification, and web/social network analysis. It lists recommended textbooks and readings. The document outlines pre-requisites for the course, including a background in databases, algorithms, and programming. It defines data mining as the process of discovering hidden patterns from large data sets to extract useful information.

Uploaded by

Muhammad Ramzan
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 44

Introduction to Data Mining

Afzaal Hussain

Email: [email protected]
Course Content
 Introduction to data mining

 Data Preprocessing (Data cleaning, data integration, data


reduction, concept hierarchies)

 Mining Association Rules (Frequent item-sets and Association


rules)

 Clustering Algorithms (Partitioning methods, Hierarchical


methods,
Density based methods)

 Classification

 Web Mining \ Social Network Analysis


Textbooks and Readings
 Text
 Introduction to Data Mining. By P.-N.Tan, M. Steinbach and V. Kumar.
 Data Mining: Concepts and Techniques. By Jiawei Han and
Micheline Kamber.
 Selected Research Papers
 Supplementary Material
 Data Mining: Practical Machine Learning Tools and Techniques. By
I.H.Witten
and E. Frank, Morgan Kaufmann.
 Mining of Massive Data Sets. By Anand Rajaram, Jure Leskovec and
Jeff Ullman

 Some textbooks are free to download


Pre-Requisites
 The students should have good background
in

 Database Systems
 Algorithms and data structures
 Programming
What is Data Mining?

Knowledge discovery from data


Introduction
 Data is growing at a phenomenal
rate
 Web data, e‐commerce
 purchases at department/grocery stores
 Bank/Credit Card transactions
 scientific simulations

UNCOVER HIDDEN INFORMATION


DATA MINING
Data contains value
and knowledge

Information “hidden” in the data


Human analysts take weeks
is not readily evident
to discover useful
information
What is Data Mining

 Data mining (knowledge discovery from data)


 Extraction of interesting (non-trivial, implicit, previously
unknown and potentially useful) patterns or knowledge
from huge amount of data

 Exploration & analysis, by automatic or semi‐


automatic means, of large quantities of data in order
to discover meaningful patterns
What is Data Mining

 Given lots of data


 Discover patterns\models that are:
 Valid: hold on new data with some certainty
 Useful: should be possible to act on the item
 Unexpected: non-obvious to the system
 Understandable: humans should be able to interpret
the pattern
Alternative names

Information Harvesting
Knowledge Mining
Data Mining
Knowledge Discovery
in Databases Data Dredging

Data Pattern Processing Data Archaeology

Database Mining Knowledge Extraction


People you may know
An algorithm that could cause a lot of grief
Meaningfulness of Analytical results

 Risk involved in Data Mining


 is that an analyst can “discover” patterns that are
meaningless

 Statisticians call it Bonferroni’s principle


 if you look in more places for interesting patterns than
your amount of data will support, you are bound to find
crap.
Meaningfulness of Analytical results

 Suggested approach:
 Human-centered, query-based, focused mining

 How to measure ?
 Interestingness
Interestingne
ss
Objective:
 based on statistics and structures of patterns, e.g.
support, confidence, etc.
 Subjective:
 based on user’s beliefs in the data, e.g. unexpectedness,
novelty, etc.

• easily understood by humans


• valid on new or test data with some
Interestingness degree of certainty.
measures • potentially useful
• novel, or validates some hypothesis that
a user seeks to confirm
Data Mining and related Disciplines
 Data mining overlaps with:
 Databases (DB) : Large-scale data, simple queries
 Machine learning (ML): Small data, Complex models
 CS Theory: (Randomized) Algorithms
 Different cultures:
 To a DB person, data mining is an extreme form of
analytic
processing – queries that
examine large amounts of data
 Result is the query answer
 To a ML person, data-mining
is the inference of
models
 Result is the parameters of
Data Mining and related Disciplines
 Emphasis is on
 scalability of number of features and instances (big data)
 stress on algorithms and architectures
 whereas foundations of methods and formulations provided by
statistics and machine learning
 automation for handling large, complex and heterogeneous
data
Database vs Data Mining
 Database
 Find all credit applicants with last name of Smith.
 Identify customers who have purchased more than $10,000 in the
last month.
 Find all customers who have purchased milk

 Data Mining
 Find all credit applicants who are poor credit risks. (classification)
 Identify customers with similar buying habits. (Clustering)
 Find all items which are frequently purchased with milk.
(association rules)
Database Processing vs. Data Mining
Processing
 Query  Query
– Well – Poorly defined
defined – No precise query
– SQL language

 Output  Output
– Precise – Fuzzy
– Subset of database – Not a subset of database
What is Data Mining?
What is not DM? Certain names are more
prevalent in certain US locations
Look up phone number
(O’Brien, O’Rurke, O’Reilly… in
in phone directory
Boston area)

Query a Web search


Group together similar
engine for information
documents returned by search
about “Amazon”
engine according to their context
(e.g. Amazon rainforest,
Amazon.com,)
Application
s
 Commercial applications
• Classification of debt inquiries
• Segmentation of customer groups
• Churn analysis
 Scientific applications
• Astronomy
• Medicine research
• Medical diagnostics
Application
s
 Banking: loan/credit card approval:
 predict good customers based on old customers
 Customer relationship management:
 identify those who are likely to leave for a competitor
 Targeted marketing:
 identify likely responders to promotions
 Fraud detection: telecommunications, finance
 from an online stream of event identify fraudulent
events
Applications

 Medicine: disease outcome, effectiveness of


treatments
 analyze patient disease history: find relationship between
diseases
 Molecular/Pharmaceutical:
 identify new drugs
 Scientific data
analysis:
 identify new galaxies by searching for sub clusters
Data Mining vs. KDD

 Knowledge Discovery in Databases (KDD):


 process of finding useful information and patterns in
data.

 Data Mining:
 Use of algorithms to extract the information and
patterns
derived by the K D D process.
Knowledge Discovery in Databases:
Process
Data mining: the Interpretation/
core of knowledge Evaluation
discovery process.
Data Mining Knowledge

Preprocessing
Patterns

Selection
Preprocessed
Data
Data
Targe
t
Data

Cleaning and Integration


KDD Process Ex: Web Log
 Selection:
 Select log data (dates and locations) to use
 Preprocessing:
 Remove identifying URLs
 Remove error logs
 Transformation:
 Sessionize (sort and group)
 Data Mining:
 Identify and count patterns
 Construct data structure
 Interpretation/Evaluation:
 Identify and display frequently accessed
sequences.
 Potential User Applications:
 Cache prediction
 Personalization
Data Mining Tasks
 Descriptive methods (Un-Supervised)
 Find human-interpretable patterns that describe the
data
 Example: Clustering

 Predictive methods (Supervised)


 Use some variables to predict unknown or
future values of other variables
 Example: Recommender systems
Data Mining Models and Tasks
 Descriptive data mining:
 Describe general
properties
 Predictive data
mining:
 Infer on available data
Classification
 Classification maps data into predefined groups or
classes
based on attribute values. (supervised classification)
 classify students based on final result.
 classify countries based on climate, or
 classify cars based on gas mileage

 Goal:
 unseen records should be assigned a class as accurately
as
possible.
Classification Example

Tid Refund Marital Taxable Refund Marital Taxable


Income Cheat Income Cheat
Status Status
1 Yes Single 125K No No Single 75K ?
2 No Married 100K No Yes Married 50K ?
3 No Single 70K No No Married 150K ?
4 Yes Married 120K No Yes Divorced 90K ?
5 No Divorced 95K Yes No Single 40K ?
6 No Married 60K No No Married 80K ? Tes
10

t
7 Yes Divorced 220K No
Set
8 No Single 85K Yes
9 No Married 75K No
Training
Learn
10 No Single 90K Yes Model
10
Set Classifier
Classification
 Typical methods
 Decision trees,
 naïve Bayesian classification,
 support vector machines,
 neural networks,
 rule-based or pattern-based
classification,
 logistic regression, …
 Typical applications:
 Credit card fraud detection,
 direct marketing,
 classifying stars, diseases, web-pages, …
Classification Application
 Direct Marketing
 Goal: Reduce cost of mailing by targeting a set of
consumers likely to buy a new cell-phone
product.
 Approach:
 Use the data for a similar product introduced before.
 We know which customers decided to buy and which decided
otherwise.This {buy, don’t buy} decision forms the class
attribute.
 Collect various demographic, lifestyle, and company-
interaction
related information about all such customers
 Use this information as input attributes to learn a classifier
model.
Clustering
 Clustering groups similar data together into clusters based
on attribute values. (unsupervised classification)

 The set of data points in each cluster have set of


attributes, and a similarity measure among them
 Data points in one cluster are more similar to one another.
 Data points in separate clusters are less similar to one
another.

 Similarity Measures:
 Euclidean Distance if attributes are continuous.
 Other Problem-specific Measures.
Illustrating Clustering
Euclidean Distance Based Clustering in 3-D space.

Intracluster Intercluster
distances are distances are
minimized maximized
Clustering Application: Market Segmentation:
 Goal: subdivide a market into distinct subsets of
customers
 where any subset may be selected as a market target to be
reached with a distinct marketing mix.

 Approach:
 Collect different attributes of customers
based on their
geographical and lifestyle related
information.
 Find clusters of similar customers.
 Measure the clustering quality by
observing buying patterns of
customers in same cluster vs. those from
different clusters.
Clustering: Application 2
 Document Clustering:
 Goal:
 To find groups of documents that are similar to each other based
on the important terms appearing in them.
 Approach:
 To identify frequently occurring terms in each document. Form a
similarity measure based on the frequencies of different terms.
Use it to cluster.
 Gain:
 Information Retrieval can utilize the clusters to relate a
new document or search term to clustered documents.
Association Rule Discovery
 Frequent patterns (or frequent itemsets)
 What items are frequently purchased together in
your Walmart?
 Produce dependency rules which will predict
occurrence of an
item based on occurrences of other items in data.
TID Items
1 Bread, Coke, Milk
2 Cereal, Bread
Rules Discovered:
3 Cereal, Coke, Diaper, Milk
{Milk} --> {Coke}
4 Cereal, Bread, Diaper, Milk
{Diaper, Milk} --> {Cereal}
5 Coke, Diaper, Milk
Association Rule Discovery: Application
 Marketing and Sales Promotion:
 Let the rule discovered be
{Bagels, … } --> {Potato Chips}
 Potato Chips as consequent =>
 Can be used to determine what should be done to boost its
sales.
 Bagels in the antecedent =>
 Can be used to see which products would be affected if the

store discontinues selling bagels.


 Bagels in antecedent and Potato chips in consequent =>
 Can be used to see what products should be sold with Bagels
to promote sale of Potato chips!
Outlier Analysis /Anomaly Detection
 Detect significant deviations from normal
behavior
 Applications:
 Credit Card Fraud Detection

 Network Intrusion
Detection
Challenges of Data Mining
 Scalability
 Dimensionality
 Complex and Heterogeneous
Data
 Data Quality
 Data Ownership and Distribution
 Privacy Preservation
 Streaming Data

You might also like