0% found this document useful (0 votes)

49 views44 pages

Introduction To Data Mining

This document provides an introduction and overview of a course on data mining. The course will cover topics such as data preprocessing, mining association rules, clustering algorithms, classification, and web/social network analysis. It lists recommended textbooks and readings. The document outlines pre-requisites for the course, including a background in databases, algorithms, and programming. It defines data mining as the process of discovering hidden patterns from large data sets to extract useful information.

Uploaded by

Muhammad Ramzan

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

49 views44 pages

Introduction To Data Mining

Uploaded by

Muhammad Ramzan

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

You are on page 1/ 44

Introduction to Data Mining

Afzaal Hussain

Email: [email protected]
Course Content
 Introduction to data mining

 Data Preprocessing (Data cleaning, data integration, data

reduction, concept hierarchies)

 Mining Association Rules (Frequent item-sets and Association

rules)

 Clustering Algorithms (Partitioning methods, Hierarchical

methods,
Density based methods)

 Classification

 Web Mining \ Social Network Analysis

Textbooks and Readings
 Text
 Introduction to Data Mining. By P.-N.Tan, M. Steinbach and V. Kumar.
 Data Mining: Concepts and Techniques. By Jiawei Han and
Micheline Kamber.
 Selected Research Papers
 Supplementary Material
 Data Mining: Practical Machine Learning Tools and Techniques. By
I.H.Witten
and E. Frank, Morgan Kaufmann.
 Mining of Massive Data Sets. By Anand Rajaram, Jure Leskovec and
Jeff Ullman

 Some textbooks are free to download

Pre-Requisites
 The students should have good background
in

 Database Systems
 Algorithms and data structures
 Programming
What is Data Mining?

Knowledge discovery from data

Introduction
 Data is growing at a phenomenal
rate
 Web data, e‐commerce
 purchases at department/grocery stores
 Bank/Credit Card transactions
 scientific simulations

UNCOVER HIDDEN INFORMATION

DATA MINING
Data contains value
and knowledge

Information “hidden” in the data

Human analysts take weeks
is not readily evident
to discover useful
information
What is Data Mining

 Data mining (knowledge discovery from data)

 Extraction of interesting (non-trivial, implicit, previously
unknown and potentially useful) patterns or knowledge
from huge amount of data

 Exploration & analysis, by automatic or semi‐

automatic means, of large quantities of data in order
to discover meaningful patterns
What is Data Mining

 Given lots of data

 Discover patterns\models that are:
 Valid: hold on new data with some certainty
 Useful: should be possible to act on the item
 Unexpected: non-obvious to the system
 Understandable: humans should be able to interpret
the pattern
Alternative names

Information Harvesting
Knowledge Mining
Data Mining
Knowledge Discovery
in Databases Data Dredging

Data Pattern Processing Data Archaeology

Database Mining Knowledge Extraction

People you may know
An algorithm that could cause a lot of grief
Meaningfulness of Analytical results

 Risk involved in Data Mining

 is that an analyst can “discover” patterns that are
meaningless

 Statisticians call it Bonferroni’s principle

 if you look in more places for interesting patterns than
your amount of data will support, you are bound to find
crap.
Meaningfulness of Analytical results

 Suggested approach:
 Human-centered, query-based, focused mining

 How to measure ?
 Interestingness
Interestingne
ss
Objective:
 based on statistics and structures of patterns, e.g.
support, confidence, etc.
 Subjective:
 based on user’s beliefs in the data, e.g. unexpectedness,
novelty, etc.

• easily understood by humans

• valid on new or test data with some
Interestingness degree of certainty.
measures • potentially useful
• novel, or validates some hypothesis that
a user seeks to confirm
Data Mining and related Disciplines
 Data mining overlaps with:
 Databases (DB) : Large-scale data, simple queries
 Machine learning (ML): Small data, Complex models
 CS Theory: (Randomized) Algorithms
 Different cultures:
 To a DB person, data mining is an extreme form of
analytic
processing – queries that
examine large amounts of data
 Result is the query answer
 To a ML person, data-mining
is the inference of
models
 Result is the parameters of
Data Mining and related Disciplines
 Emphasis is on
 scalability of number of features and instances (big data)
 stress on algorithms and architectures
 whereas foundations of methods and formulations provided by
statistics and machine learning
 automation for handling large, complex and heterogeneous
data
Database vs Data Mining
 Database
 Find all credit applicants with last name of Smith.
 Identify customers who have purchased more than $10,000 in the
last month.
 Find all customers who have purchased milk

 Data Mining
 Find all credit applicants who are poor credit risks. (classification)
 Identify customers with similar buying habits. (Clustering)
 Find all items which are frequently purchased with milk.
(association rules)
Database Processing vs. Data Mining
Processing
 Query  Query
– Well – Poorly defined
defined – No precise query
– SQL language

 Output  Output
– Precise – Fuzzy
– Subset of database – Not a subset of database
What is Data Mining?
What is not DM? Certain names are more
prevalent in certain US locations
Look up phone number
(O’Brien, O’Rurke, O’Reilly… in
in phone directory
Boston area)

Query a Web search

Group together similar
engine for information
documents returned by search
about “Amazon”
engine according to their context
(e.g. Amazon rainforest,
Amazon.com,)
Application
s
 Commercial applications
• Classification of debt inquiries
• Segmentation of customer groups
• Churn analysis
 Scientific applications
• Astronomy
• Medicine research
• Medical diagnostics
Application
s
 Banking: loan/credit card approval:
 predict good customers based on old customers
 Customer relationship management:
 identify those who are likely to leave for a competitor
 Targeted marketing:
 identify likely responders to promotions
 Fraud detection: telecommunications, finance
 from an online stream of event identify fraudulent
events
Applications

 Medicine: disease outcome, effectiveness of

treatments
 analyze patient disease history: find relationship between
diseases
 Molecular/Pharmaceutical:
 identify new drugs
 Scientific data
analysis:
 identify new galaxies by searching for sub clusters
Data Mining vs. KDD

 Knowledge Discovery in Databases (KDD):

 process of finding useful information and patterns in
data.

 Data Mining:
 Use of algorithms to extract the information and
patterns
derived by the K D D process.
Knowledge Discovery in Databases:
Process
Data mining: the Interpretation/
core of knowledge Evaluation
discovery process.
Data Mining Knowledge

Preprocessing
Patterns

Selection
Preprocessed
Data
Data
Targe
t
Data

Cleaning and Integration

KDD Process Ex: Web Log
 Selection:
 Select log data (dates and locations) to use
 Preprocessing:
 Remove identifying URLs
 Remove error logs
 Transformation:
 Sessionize (sort and group)
 Data Mining:
 Identify and count patterns
 Construct data structure
 Interpretation/Evaluation:
 Identify and display frequently accessed
sequences.
 Potential User Applications:
 Cache prediction
 Personalization
Data Mining Tasks
 Descriptive methods (Un-Supervised)
 Find human-interpretable patterns that describe the
data
 Example: Clustering

 Predictive methods (Supervised)

 Use some variables to predict unknown or
future values of other variables
 Example: Recommender systems
Data Mining Models and Tasks
 Descriptive data mining:
 Describe general
properties
 Predictive data
mining:
 Infer on available data
Classification
 Classification maps data into predefined groups or
classes
based on attribute values. (supervised classification)
 classify students based on final result.
 classify countries based on climate, or
 classify cars based on gas mileage

 Goal:
 unseen records should be assigned a class as accurately
as
possible.
Classification Example

Tid Refund Marital Taxable Refund Marital Taxable

Income Cheat Income Cheat
Status Status
1 Yes Single 125K No No Single 75K ?
2 No Married 100K No Yes Married 50K ?
3 No Single 70K No No Married 150K ?
4 Yes Married 120K No Yes Divorced 90K ?
5 No Divorced 95K Yes No Single 40K ?
6 No Married 60K No No Married 80K ? Tes
10

t
7 Yes Divorced 220K No
Set
8 No Single 85K Yes
9 No Married 75K No
Training
Learn
10 No Single 90K Yes Model
10
Set Classifier
Classification
 Typical methods
 Decision trees,
 naïve Bayesian classification,
 support vector machines,
 neural networks,
 rule-based or pattern-based
classification,
 logistic regression, …
 Typical applications:
 Credit card fraud detection,
 direct marketing,
 classifying stars, diseases, web-pages, …
Classification Application
 Direct Marketing
 Goal: Reduce cost of mailing by targeting a set of
consumers likely to buy a new cell-phone
product.
 Approach:
 Use the data for a similar product introduced before.
 We know which customers decided to buy and which decided
otherwise.This {buy, don’t buy} decision forms the class
attribute.
 Collect various demographic, lifestyle, and company-
interaction
related information about all such customers
 Use this information as input attributes to learn a classifier
model.
Clustering
 Clustering groups similar data together into clusters based
on attribute values. (unsupervised classification)

 The set of data points in each cluster have set of

attributes, and a similarity measure among them
 Data points in one cluster are more similar to one another.
 Data points in separate clusters are less similar to one
another.

 Similarity Measures:
 Euclidean Distance if attributes are continuous.
 Other Problem-specific Measures.
Illustrating Clustering
Euclidean Distance Based Clustering in 3-D space.

Intracluster Intercluster
distances are distances are
minimized maximized
Clustering Application: Market Segmentation:
 Goal: subdivide a market into distinct subsets of
customers
 where any subset may be selected as a market target to be
reached with a distinct marketing mix.

 Approach:
 Collect different attributes of customers
based on their
geographical and lifestyle related
information.
 Find clusters of similar customers.
 Measure the clustering quality by
observing buying patterns of
customers in same cluster vs. those from
different clusters.
Clustering: Application 2
 Document Clustering:
 Goal:
 To find groups of documents that are similar to each other based
on the important terms appearing in them.
 Approach:
 To identify frequently occurring terms in each document. Form a
similarity measure based on the frequencies of different terms.
Use it to cluster.
 Gain:
 Information Retrieval can utilize the clusters to relate a
new document or search term to clustered documents.
Association Rule Discovery
 Frequent patterns (or frequent itemsets)
 What items are frequently purchased together in
your Walmart?
 Produce dependency rules which will predict
occurrence of an
item based on occurrences of other items in data.
TID Items
1 Bread, Coke, Milk
2 Cereal, Bread
Rules Discovered:
3 Cereal, Coke, Diaper, Milk
{Milk} --> {Coke}
4 Cereal, Bread, Diaper, Milk
{Diaper, Milk} --> {Cereal}
5 Coke, Diaper, Milk
Association Rule Discovery: Application
 Marketing and Sales Promotion:
 Let the rule discovered be
{Bagels, … } --> {Potato Chips}
 Potato Chips as consequent =>
 Can be used to determine what should be done to boost its
sales.
 Bagels in the antecedent =>
 Can be used to see which products would be affected if the

store discontinues selling bagels.

 Bagels in antecedent and Potato chips in consequent =>
 Can be used to see what products should be sold with Bagels
to promote sale of Potato chips!
Outlier Analysis /Anomaly Detection
 Detect significant deviations from normal
behavior
 Applications:
 Credit Card Fraud Detection

 Network Intrusion
Detection
Challenges of Data Mining
 Scalability
 Dimensionality
 Complex and Heterogeneous
Data
 Data Quality
 Data Ownership and Distribution
 Privacy Preservation
 Streaming Data

Data Mining Introduction
No ratings yet
Data Mining Introduction
41 pages
Data Mining Merged PDF CS1 CS8
No ratings yet
Data Mining Merged PDF CS1 CS8
272 pages
Data Mining
No ratings yet
Data Mining
254 pages
Introduction to Robotics
From Everand
Introduction to Robotics
Swarnalata Verma
No ratings yet
IS352 - Lecture 01
No ratings yet
IS352 - Lecture 01
62 pages
Data Mining - Concepts and Techniques
No ratings yet
Data Mining - Concepts and Techniques
224 pages
Data Mining Slide
No ratings yet
Data Mining Slide
35 pages
Combinepdf 1
No ratings yet
Combinepdf 1
74 pages
Wk. 1. Introduction (08.10.2020)
No ratings yet
Wk. 1. Introduction (08.10.2020)
30 pages
60 Common Data Mining Interview Questions in 2025
No ratings yet
60 Common Data Mining Interview Questions in 2025
20 pages
Datamining ch1
No ratings yet
Datamining ch1
24 pages
Data Mining Chapter 1 Notes
No ratings yet
Data Mining Chapter 1 Notes
40 pages
Datamining 1
No ratings yet
Datamining 1
30 pages
Lec 1
No ratings yet
Lec 1
33 pages
Lecture 2
No ratings yet
Lecture 2
66 pages
Tum Dersler Veri Madenciligi
No ratings yet
Tum Dersler Veri Madenciligi
123 pages
Data Mining Introduction
No ratings yet
Data Mining Introduction
35 pages
L1 Intro
No ratings yet
L1 Intro
32 pages
Introduction Lecture1gghhhhh
No ratings yet
Introduction Lecture1gghhhhh
23 pages
1-Data Mining and Applications
No ratings yet
1-Data Mining and Applications
70 pages
Unit 1
No ratings yet
Unit 1
59 pages
Datamining&warehousing
No ratings yet
Datamining&warehousing
65 pages
Knowledge Management - 10 - Data Mining Overview
No ratings yet
Knowledge Management - 10 - Data Mining Overview
41 pages
CT075!3!2-DTM-Topic 8 - Introduction To Data Mining
No ratings yet
CT075!3!2-DTM-Topic 8 - Introduction To Data Mining
32 pages
Graphic Materials - Rules For Describing Original Items in Historical Collections
No ratings yet
Graphic Materials - Rules For Describing Original Items in Historical Collections
174 pages
Week 4 - Introduction To Data Mining and Data Mining Techniques
No ratings yet
Week 4 - Introduction To Data Mining and Data Mining Techniques
44 pages
Data Mining
No ratings yet
Data Mining
23 pages
1 - DM
No ratings yet
1 - DM
5 pages
Chapter 5 - Data Mining
No ratings yet
Chapter 5 - Data Mining
29 pages
DM-Unit-I Introduction To Association-1
No ratings yet
DM-Unit-I Introduction To Association-1
97 pages
Data Mining
No ratings yet
Data Mining
88 pages
01-Introduction To Data Mining
No ratings yet
01-Introduction To Data Mining
43 pages
1 - Lect 1 & 2 Data Mining
No ratings yet
1 - Lect 1 & 2 Data Mining
20 pages
4 - Web Design Principles
No ratings yet
4 - Web Design Principles
25 pages
What Is Not Data Mining - Ex: Generation of Attendance Report (Of A Course) From Registration Cards. - Student Table (STD)
No ratings yet
What Is Not Data Mining - Ex: Generation of Attendance Report (Of A Course) From Registration Cards. - Student Table (STD)
33 pages
Introduction To Data Mining
No ratings yet
Introduction To Data Mining
46 pages
Lecture 01 11jan
No ratings yet
Lecture 01 11jan
29 pages
Introduction To Data Mining
No ratings yet
Introduction To Data Mining
69 pages
Data Science Mastery: From Beginner to Expert in Big Data Analytics
From Everand
Data Science Mastery: From Beginner to Expert in Big Data Analytics
Kameron Hussain
No ratings yet
IT326 - Ch1
100% (1)
IT326 - Ch1
17 pages
CPS 196.03: Information Management and Mining: Shivnath Babu
No ratings yet
CPS 196.03: Information Management and Mining: Shivnath Babu
30 pages
PRISMA Diagram Template
No ratings yet
PRISMA Diagram Template
1 page
CSM6404 DM L1
No ratings yet
CSM6404 DM L1
29 pages
UNIT 2 A Typical PC
100% (1)
UNIT 2 A Typical PC
8 pages
Mlns Notes
No ratings yet
Mlns Notes
20 pages
Presentation 1
No ratings yet
Presentation 1
28 pages
Data Mining
No ratings yet
Data Mining
26 pages
DBMS Notes For BCA
67% (6)
DBMS Notes For BCA
9 pages
Chapter 1 Data Mining Lecture Note
No ratings yet
Chapter 1 Data Mining Lecture Note
31 pages
3-OLAP Operations-13!08!2021 (13-Aug-2021) Material I 13-Aug-2021 Data Mining - Introductory Slides
No ratings yet
3-OLAP Operations-13!08!2021 (13-Aug-2021) Material I 13-Aug-2021 Data Mining - Introductory Slides
37 pages
Data Mining
No ratings yet
Data Mining
63 pages
Data Mining Notes
100% (1)
Data Mining Notes
45 pages
Lecture 2 Data Mining Functions
No ratings yet
Lecture 2 Data Mining Functions
40 pages
Data Mining: July 18, 2019 1
No ratings yet
Data Mining: July 18, 2019 1
41 pages
Introduction To Data Mining-Week1
No ratings yet
Introduction To Data Mining-Week1
43 pages
IME 672-Chapter 1 PDF
No ratings yet
IME 672-Chapter 1 PDF
41 pages
Data Mining Course Overview
No ratings yet
Data Mining Course Overview
38 pages
4 Datamining
No ratings yet
4 Datamining
90 pages
Data Mining Concepts
No ratings yet
Data Mining Concepts
35 pages
Rumus Hitung Pengembalian BRM
No ratings yet
Rumus Hitung Pengembalian BRM
12 pages
Project: Secondary Storage Devices
100% (1)
Project: Secondary Storage Devices
10 pages
Chapter 6 Data Mining
No ratings yet
Chapter 6 Data Mining
39 pages
Introduction To Data Mining: Dr. Hany Saleeb
No ratings yet
Introduction To Data Mining: Dr. Hany Saleeb
17 pages
Evidence Handling
No ratings yet
Evidence Handling
18 pages
1 Intro
No ratings yet
1 Intro
33 pages
Mehrdad Jalali: Jalali@mshdiau - Ac.ir Jalali - Mshdiau.ac - Ir
No ratings yet
Mehrdad Jalali: Jalali@mshdiau - Ac.ir Jalali - Mshdiau.ac - Ir
27 pages
ORA-01555: "Snapshot Too Old" - Overview: 1. Fewer and Smaller Rollback Segments For A Very Actively Changing Database
No ratings yet
ORA-01555: "Snapshot Too Old" - Overview: 1. Fewer and Smaller Rollback Segments For A Very Actively Changing Database
9 pages
5.mis Chapter 5 Is For Knowlege MGMT
No ratings yet
5.mis Chapter 5 Is For Knowlege MGMT
27 pages
To Data Mining: Motivation: "Necessity Is The Mother of Invention"
No ratings yet
To Data Mining: Motivation: "Necessity Is The Mother of Invention"
14 pages
Book Shop Inventory System
100% (1)
Book Shop Inventory System
9 pages
Module 5 - IBM Spectrum Archive and FLAPE V3
No ratings yet
Module 5 - IBM Spectrum Archive and FLAPE V3
17 pages
Data Mining: An Overview From A Database Perspective
No ratings yet
Data Mining: An Overview From A Database Perspective
30 pages
LESSON 7. Normalization of Database Tables
No ratings yet
LESSON 7. Normalization of Database Tables
34 pages
Colegio de San Gabriel Arcangel: Program Course Code Description
No ratings yet
Colegio de San Gabriel Arcangel: Program Course Code Description
5 pages
11 CSS Week 2 Day 3
No ratings yet
11 CSS Week 2 Day 3
4 pages
2.0 Updated Capstone
No ratings yet
2.0 Updated Capstone
33 pages
Os Lesson 3 File Management
No ratings yet
Os Lesson 3 File Management
9 pages
Pelatihan Literasi Informasi Berbasis Digital Untuk Guru Sekolah Menengah Marlini, Gustina Erlianti
No ratings yet
Pelatihan Literasi Informasi Berbasis Digital Untuk Guru Sekolah Menengah Marlini, Gustina Erlianti
9 pages
Terminal Handout 1
No ratings yet
Terminal Handout 1
4 pages
Unit VII Advanced Topics
No ratings yet
Unit VII Advanced Topics
23 pages
Data Mining and Sentiment Analysis: Discovering Emotional Patterns in Text Data
No ratings yet
Data Mining and Sentiment Analysis: Discovering Emotional Patterns in Text Data
8 pages
An Information System
No ratings yet
An Information System
2 pages
Bioinformatics & Computational Biology Syllabus
No ratings yet
Bioinformatics & Computational Biology Syllabus
2 pages
CITRA
No ratings yet
CITRA
7 pages
Website Resume
No ratings yet
Website Resume
2 pages
Infor M3 Analytics
No ratings yet
Infor M3 Analytics
4 pages
Data Mining: Fundamentals and Applications
From Everand
Data Mining: Fundamentals and Applications
Fouad Sabry
No ratings yet
RubanRaj InternshalaResume
No ratings yet
RubanRaj InternshalaResume
2 pages
IBM QRadar + Keysight Visibility - Intelligence To Streamline Response
No ratings yet
IBM QRadar + Keysight Visibility - Intelligence To Streamline Response
4 pages
SQL Optimization Techniques
No ratings yet
SQL Optimization Techniques
13 pages

Introduction To Data Mining

Uploaded by

Introduction To Data Mining

Uploaded by

Introduction to Data Mining

 Data Preprocessing (Data cleaning, data integration, data

 Mining Association Rules (Frequent item-sets and Association

 Clustering Algorithms (Partitioning methods, Hierarchical

 Web Mining \ Social Network Analysis

 Some textbooks are free to download

Knowledge discovery from data

UNCOVER HIDDEN INFORMATION

Information “hidden” in the data

 Data mining (knowledge discovery from data)

 Exploration & analysis, by automatic or semi‐

 Given lots of data

Data Pattern Processing Data Archaeology

Database Mining Knowledge Extraction

 Risk involved in Data Mining

 Statisticians call it Bonferroni’s principle

• easily understood by humans

Query a Web search

 Medicine: disease outcome, effectiveness of

 Knowledge Discovery in Databases (KDD):

Cleaning and Integration

 Predictive methods (Supervised)

Tid Refund Marital Taxable Refund Marital Taxable

 The set of data points in each cluster have set of

store discontinues selling bagels.

You might also like