0% found this document useful (0 votes)

14 views40 pages

DM Lec1

The document discusses data mining and provides definitions and examples of key concepts like association rules, clustering, and classification. It defines data mining as the extraction of useful patterns from large amounts of data and explains why it is needed due to the huge growth of digital data.

Uploaded by

هارون المقطري

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

14 views40 pages

DM Lec1

Uploaded by

هارون المقطري

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 40

DATA MINING

Introduction
Lec 1

Mohammed
What is data mining?
• After years of data mining there is still no unique
answer to this question.

• A tentative deﬁnition:

Data mining is the use of eﬃcient techniques for

the analysis of very large collections of data and the
extraction of useful and possibly unexpected
patterns in data.
What is data mining?
• Data mining (knowledge discovery in databases):
• Extraction of interesting (non-trivial, implicit, previously unknown
and potentially useful) information or patterns from data in large
databases

• Alternative names and their “inside stories”:

• Data mining: a misnomer?
• Knowledge discovery(mining) in databases (KDD), knowledge
extraction, data/pattern analysis, data archeology, business
intelligence, etc.
What is (not) Data Mining?
● What is not Data Mining?

– Look up phone number in phone directory

– Query a Web search engine for information about
“Amazon”
Database vs Data Mining
• Database
• Find all credit applicants with last name of Smith.
• Identify customers who have purchased more than $10,000
in the last month.
• Find all customers who have purchased milk

• Data Mining
• Find all credit applicants who are poor credit risks.
(classiﬁcation)
• Identify customers with similar buying habits. (Clustering)
• Find all items which are frequently purchased with milk.
(association rules)
A Bit of History
•We are drowning in data, but starving for knowledge.
(John Naisbitt, 1982)

•It has been estimated that the amount of information

in the world doubles every 20 months.
(Frawley, Piatetsky-Shapiro, Matheus, 1992)
We are Drowning in Data...

Wikipedia (en, text only)

≈ 20 GB of data

James Webb
Telescope
≈57 GB/day
≈21 TB/year
We are Drowning in Data...
•
Facebook
≈12 TB/day added
(as of Mar. 2010 )

Google
≈20 PB/day processed
(Jan. 2010 )
We are Drowning in Data...
We are Drowning in Data...
...but starving for knowledge!

Rate at which data are produced

Rate at which data can be understood

manual interpretation is hardly feasible!
Motivation:
• Data explosion problem

• Automated data collection tools and mature database technology

lead to tremendous amounts of data stored in databases, data

warehouses and other information repositories

• Solution: Data warehousing and data mining

• Data warehousing and on-line analytical processing

• Extraction of interesting knowledge (rules, regularities, patterns,

constraints) from data in large databases

Why Mine Data ?
• Commercial Viewpoint :
• Lots of data is being collected
and warehoused
• Web data, e-commerce
• purchases at department/
grocery stores
• Bank/Credit Card
transactions

• Scientiﬁc Viewpoint:
• Data collected and stored at
enormous speeds (GB/hour)
• remote sensors on a satellite
• telescopes scanning the skies
• microarrays generating gene
expression data
• scientiﬁc simulations
generating terabytes of data
Data is power!
• “The data is the computer”
• Large amounts of data can be more powerful than
complex algorithms and models
• Google has solved many Natural Language Processing problems,
simply by looking at the data
• Example: misspellings, synonyms
• Data is power!
• Today, the collected data is one of the biggest assets of an
online company
• Query logs of Google
• The friendship and updates of Facebook
• Tweets and follows of Twitter
• Amazon transactions
Data is power!
• Competitive Pressure is Strong
• Provide better, customized services for anedge (e.g. in Customer
Relationship Management)

• Traditional techniques infeasible for raw data

• Data mining may help scientists
•Really, really huge amounts of raw data!!
•In the digital age, TB of data is generated
by the second
•Mobile devices, digital photographs, web
documents.
•Facebook updates, Tweets, Blogs, User-
generated content
•Transactions, sensor data, surveillance data
•Queries, clicks, browsing
•Cheap storage has made possible to
maintain this data
Data, Information, Knowledge, and
Wisdom
Example (Cholera)
• Cholera disease
• From beginning of 19th century
• ~100,000 deaths per year
– until today!
• For a long time,
there was little knowledge
– on ways of infection
– on causes of the disease
Example (CoViD-19)
• Data Mining can help understanding
– pathways and chains of infection
– critical preconditions of patients
• previous diseases
• medications
• genetic preconditions
-effectiveness of prevention strategies
What is Data Mining again?
• “Data mining is the discovery of models for data”
(Rajaraman, Ullman)
Origins of Data Mining
• Draws ideas from machine learning, statistics, and
database systems.
• Traditional techniques may be unsuitable due to
– large amount of data
– high dimensionality of data
– heterogeneous, distributed nature of data
Data Mining: Classiﬁcation Schemes
• Decisions in data mining
• Kinds of databases to be mined

• Kinds of knowledge to be discovered

• Kinds of techniques utilized

• Kinds of applications adapted

• Data mining tasks

• Descriptive data mining

• Predictive data mining

Decisions in Data Mining
• Databases to be mined
• Relational, transactional, object-oriented, object-relational, active,
spatial, time-series, text, multi-media, heterogeneous, legacy,
WWW, etc.
• Knowledge to be mined
• Characterization, discrimination, association, classiﬁcation,
clustering, trend, deviation and outlier analysis, etc.
• Multiple/integrated functions and mining at multiple levels
• Techniques utilized
• Database-oriented, data warehouse (OLAP), machine learning,
statistics, visualization, neural network, etc.
• Applications adapted
• Retail, telecommunication, banking, fraud analysis, DNA mining, stock
market analysis, Web mining, Weblog analysis, etc.
Data Mining Tasks
• Prediction Tasks
• Use some variables to predict unknown or future values of other
variables
• Description Tasks
• Find human-interpretable patterns that describe the data.

Common data mining tasks

• Classiﬁcation [Predictive]
• Clustering [Descriptive]
• Association Rule Discovery [Descriptive]
• Sequential Pattern Discovery [Descriptive]
• Regression [Predictive]
• Deviation Detection [Predictive]
Data Mining Models and Tasks
ASSOCIATION RULES
Frequent Itemsets and Association
Rules
• Given a set of records each of which contain some number
of items from a given collection;
• Identify sets of items (itemsets) occurring frequently
together
• Produce dependency rules which will predict occurrence
of an item based on occurrences of other items.

Itemsets Discovered:
{Milk,Coke}
{Diaper, Milk}

Rules Discovered:
{Milk} --> {Coke}
{Diaper, Milk} --> {Beer}

Tan, M. Steinbach and V. Kumar, Introduction to Data Mining

Frequent Itemsets: Applications
• Text mining: ﬁnding associated phrases in text
• There are lots of documents that contain the phrases
“association rules”, “data mining” and “eﬃcient
algorithm”

• Recommendations:
• Users who buy this item often buy this item as well
• Users who watched James Bond movies, also watched
Jason Bourne movies.

• Recommendations make use of item and user similarity

Association Rule Discovery:
Application
• Supermarket shelf management.
• Goal: To identify items that are bought together by
sufficiently many customers.
• Approach: Process the point-of-sale data collected
with barcode scanners to find dependencies among
items.
• A classic rule --
• If a customer buys diaper and milk, then he is very likely to
buy beer.
• So, don’t be surprised if you find six-packs stacked next to
diapers!

Tan, M. Steinbach and V. Kumar, Introduction to Data Mining

CLUSTERING
Clustering Definition
• Given a set of data points, each having a set of
attributes, and a similarity measure among them,
find clusters such that
• Data points in one cluster are more similar to one
another.
• Data points in separate clusters are less similar to
one another.
• Similarity Measures?
• Euclidean Distance if attributes are continuous.
• Other Problem-specific Measures.

Tan, M. Steinbach and V. Kumar, Introduction to Data Mining

Illustrating Clustering
Euclidean Distance Based Clustering in 3-D space.

Intracluster distances Intercluster distances

are minimized are maximized

Tan, M. Steinbach and V. Kumar, Introduction to Data Mining

Clustering: Application 2
• Document Clustering:
• Goal: To ﬁnd groups of documents that are similar to
each other based on the important terms appearing in
them.
• Approach: To identify frequently occurring terms in
each document. Form a similarity measure based on
the frequencies of different terms. Use it to cluster.
• Gain: Information Retrieval can utilize the clusters to
relate a new document or search term to clustered
documents.

Tan, M. Steinbach and V. Kumar, Introduction to Data Mining

CLASSIFICATION
Classiﬁcation: Deﬁnition
• Given a collection of records (training set )
• Each record contains a set of attributes , one of the
attributes is theclass .
• Find amodel for class attribute as a function of
the values of other attributes.

• Goal: previously unseen records should be

assigned a class as accurately as possible.
• A test set is used to determine the accuracy of the
model. Usually, the given data set is divided into
training and test sets, with training set used to build
the model and test set used to validate it.
Classiﬁcation
a l a l s
Example
ic ic ou
r r u
ego ego tin s
a t a t on s
c c c cla

Test
Set

Learn
Training Model
Set Classiﬁer

Tan, M. Steinbach and V. Kumar, Introduction to Data Mining

Classification: Application 1
• Ad Click Prediction
• Goal: Predict if a user that visits a web page will click
on a displayed ad. Use it to target users with high
click probability.
• Approach:
• Collect data for users over a period of time and record who
clicks and who does not. The {click, no click} information
forms the class attribute.
• Use the history of the user (web pages browsed, queries
issued) as the features.
• Learn a classifier model and test on new users.
Classification: Application 2
• Fraud Detection
• Goal: Predict fraudulent cases in credit card transactions.
• Approach:
• Use credit card transactions and the information on its account-
holder as attributes.
• When does a customer buy, what does he buy, how often he pays on time,
etc
• Label past transactions as fraud or fair transactions. This forms the
class attribute.
• Learn a model for the class of the transactions.
• Use this model to detect fraud by observing credit card transactions
on an account.

Tan, M. Steinbach and V. Kumar, Introduction to Data Mining

ANY QUESTIONS?

Data Mining
No ratings yet
Data Mining
395 pages
Logiqids Worksheets - Senior KG
73% (11)
Logiqids Worksheets - Senior KG
7 pages
All About The T-CON Board: LED/LCD TV T-CON & Screen Panel Repair Guide
100% (10)
All About The T-CON Board: LED/LCD TV T-CON & Screen Panel Repair Guide
21 pages
DM Consolidated
100% (1)
DM Consolidated
676 pages
Week 1A - Overview and Introduction of Data Mining
No ratings yet
Week 1A - Overview and Introduction of Data Mining
41 pages
Cognizant Response To AZ CISS RFP-112806-Word
100% (4)
Cognizant Response To AZ CISS RFP-112806-Word
241 pages
Dm1 Introduction ML Data Mining
100% (1)
Dm1 Introduction ML Data Mining
39 pages
Lecture 2
No ratings yet
Lecture 2
66 pages
Datamining Lect1
No ratings yet
Datamining Lect1
61 pages
CS822 DataMining Week1
No ratings yet
CS822 DataMining Week1
97 pages
Data Mining Merged PDF CS1 CS8
No ratings yet
Data Mining Merged PDF CS1 CS8
272 pages
1 - Lect 1 & 2 Data Mining
No ratings yet
1 - Lect 1 & 2 Data Mining
20 pages
Introduction To Data Mining
100% (1)
Introduction To Data Mining
643 pages
Datamining-Lect1 2
No ratings yet
Datamining-Lect1 2
44 pages
Data Mining: Introduction: Lecture Notes For Chapter 1
No ratings yet
Data Mining: Introduction: Lecture Notes For Chapter 1
32 pages
Tum Dersler Veri Madenciligi
No ratings yet
Tum Dersler Veri Madenciligi
123 pages
COEN413 Machine Learning-2
No ratings yet
COEN413 Machine Learning-2
38 pages
1-Data Mining and Applications
No ratings yet
1-Data Mining and Applications
70 pages
DB 14
No ratings yet
DB 14
97 pages
CT075!3!2-DTM-Topic 8 - Introduction To Data Mining
No ratings yet
CT075!3!2-DTM-Topic 8 - Introduction To Data Mining
32 pages
DATA MINING For Search Engines
No ratings yet
DATA MINING For Search Engines
33 pages
Chapter 1 - Tagged
No ratings yet
Chapter 1 - Tagged
46 pages
Combinepdf 1
No ratings yet
Combinepdf 1
74 pages
L1 Intro
No ratings yet
L1 Intro
32 pages
Lecture Notes For Chapter 1: by Tan, Steinbach, Kumar
No ratings yet
Lecture Notes For Chapter 1: by Tan, Steinbach, Kumar
37 pages
Lecture Notes For Chapter 1 Introduction To Data Mining: by Tan, Steinbach, Kumar
No ratings yet
Lecture Notes For Chapter 1 Introduction To Data Mining: by Tan, Steinbach, Kumar
32 pages
Lec 1
No ratings yet
Lec 1
33 pages
3 DM
No ratings yet
3 DM
36 pages
Lecture Notes For Chapter 1 Introduction To Data Mining: by Tan, Steinbach, Kumar
No ratings yet
Lecture Notes For Chapter 1 Introduction To Data Mining: by Tan, Steinbach, Kumar
34 pages
Introduction Lecture1gghhhhh
No ratings yet
Introduction Lecture1gghhhhh
23 pages
Qdoc - Tips Raghu Sir Jspiders Programs
No ratings yet
Qdoc - Tips Raghu Sir Jspiders Programs
132 pages
Lecture Notes For Chapter 1 Introduction To Data Mining: by Tan, Steinbach, Kumar
No ratings yet
Lecture Notes For Chapter 1 Introduction To Data Mining: by Tan, Steinbach, Kumar
32 pages
1 Chapter One
No ratings yet
1 Chapter One
54 pages
What Is Data Mining?: Many Definitions
No ratings yet
What Is Data Mining?: Many Definitions
15 pages
1 Lect - 1.2 - 12 - August 2022 PDF
No ratings yet
1 Lect - 1.2 - 12 - August 2022 PDF
59 pages
1 IT326 - Ch1 - Introduction
No ratings yet
1 IT326 - Ch1 - Introduction
37 pages
Lecture Notes For Chapter 1 Introduction To Data Mining: by Tan, Steinbach, Kumar
No ratings yet
Lecture Notes For Chapter 1 Introduction To Data Mining: by Tan, Steinbach, Kumar
31 pages
1 - 1 Intro To Data Mining - ch1
No ratings yet
1 - 1 Intro To Data Mining - ch1
18 pages
Wk. 1. Introduction (08.10.2020)
No ratings yet
Wk. 1. Introduction (08.10.2020)
30 pages
Data Mining and Its Branches
No ratings yet
Data Mining and Its Branches
37 pages
Introduction
No ratings yet
Introduction
26 pages
Unit 3
No ratings yet
Unit 3
23 pages
Concepts and Techniques: - Chapter 1
No ratings yet
Concepts and Techniques: - Chapter 1
48 pages
Data Mining: July 18, 2019 1
No ratings yet
Data Mining: July 18, 2019 1
41 pages
Introduction To Data Mining
No ratings yet
Introduction To Data Mining
44 pages
Data Mining and Its Applications
No ratings yet
Data Mining and Its Applications
60 pages
01-Introduction To Data Mining
No ratings yet
01-Introduction To Data Mining
43 pages
Lecture Notes For Chapter 1 Introduction To Data Mining
No ratings yet
Lecture Notes For Chapter 1 Introduction To Data Mining
16 pages
CPS 196.03: Information Management and Mining: Shivnath Babu
No ratings yet
CPS 196.03: Information Management and Mining: Shivnath Babu
30 pages
Data Mining
No ratings yet
Data Mining
23 pages
LectureSlide 1
No ratings yet
LectureSlide 1
12 pages
To Data Mining: Motivation: "Necessity Is The Mother of Invention"
No ratings yet
To Data Mining: Motivation: "Necessity Is The Mother of Invention"
14 pages
What Is Not Data Mining - Ex: Generation of Attendance Report (Of A Course) From Registration Cards. - Student Table (STD)
No ratings yet
What Is Not Data Mining - Ex: Generation of Attendance Report (Of A Course) From Registration Cards. - Student Table (STD)
33 pages
Motivation For Data Mining The Information Crisis
No ratings yet
Motivation For Data Mining The Information Crisis
13 pages
Data Mining Concepts
No ratings yet
Data Mining Concepts
35 pages
Data-Mining FINAL
No ratings yet
Data-Mining FINAL
45 pages
Chapter 6 Data Mining
No ratings yet
Chapter 6 Data Mining
39 pages
Introduction To Data Mining Techniques: Dr. Rajni Jain
No ratings yet
Introduction To Data Mining Techniques: Dr. Rajni Jain
11 pages
Networks Laboratory (IT 3095), Lesson Plan-Autumn 2023
No ratings yet
Networks Laboratory (IT 3095), Lesson Plan-Autumn 2023
8 pages
Sample of Writing Cover Letter For Job Application
100% (1)
Sample of Writing Cover Letter For Job Application
6 pages
Telit 3g Modules at Commands Reference Guide r9
No ratings yet
Telit 3g Modules at Commands Reference Guide r9
537 pages
C1SE.38 SprintBacklog EQR
No ratings yet
C1SE.38 SprintBacklog EQR
8 pages
Practical and Scientific Aspects of Injection Molding Simulation
No ratings yet
Practical and Scientific Aspects of Injection Molding Simulation
156 pages
Flask Restx Readthedocs Io en Latest
No ratings yet
Flask Restx Readthedocs Io en Latest
95 pages
Q1 An Innovative Strategy To Anticipate Students Cheating The Development of Automatic Essay Assessment On The MoLearn Learning Management System
100% (1)
Q1 An Innovative Strategy To Anticipate Students Cheating The Development of Automatic Essay Assessment On The MoLearn Learning Management System
10 pages
Assignment 1 Advanced Programming
No ratings yet
Assignment 1 Advanced Programming
37 pages
Chapter 1 v8.2
No ratings yet
Chapter 1 v8.2
72 pages
Intermediate Python Nanodegree Program Syllabus
No ratings yet
Intermediate Python Nanodegree Program Syllabus
11 pages
SE Answer Key
No ratings yet
SE Answer Key
17 pages
Python Programs1
No ratings yet
Python Programs1
7 pages
ISP 39 - Joining Letter
No ratings yet
ISP 39 - Joining Letter
4 pages
UML Sequence-Communication-Timing
No ratings yet
UML Sequence-Communication-Timing
86 pages
OWASP Quick Start Guide
No ratings yet
OWASP Quick Start Guide
13 pages
CP 4 Ba Install 24
No ratings yet
CP 4 Ba Install 24
10 pages
DG 441
No ratings yet
DG 441
12 pages
Teach Your Raspberry Pi - "Yeah, World"
No ratings yet
Teach Your Raspberry Pi - "Yeah, World"
10 pages
Cyber Arrow 3
No ratings yet
Cyber Arrow 3
9 pages
Cataloge Textures
No ratings yet
Cataloge Textures
34 pages
M.Tech SE Curriculam Syllabi - 2019 - 2020
No ratings yet
M.Tech SE Curriculam Syllabi - 2019 - 2020
12 pages
Project IS3940 - PNU
No ratings yet
Project IS3940 - PNU
28 pages
3 Lecture
No ratings yet
3 Lecture
21 pages
Articulo 1
No ratings yet
Articulo 1
12 pages
TM 39 Capilized
No ratings yet
TM 39 Capilized
2 pages
Bycatch Academic Vocabulary
No ratings yet
Bycatch Academic Vocabulary
2 pages
Intern Description
No ratings yet
Intern Description
3 pages
Em70 140
No ratings yet
Em70 140
2 pages
Cursed Emoji Love - Google Search
No ratings yet
Cursed Emoji Love - Google Search
1 page
Maranatha Christian Academy: Senior High School Department
No ratings yet
Maranatha Christian Academy: Senior High School Department
2 pages
Principles of Data Mining
From Everand
Principles of Data Mining
Subodh Keshari
No ratings yet
Introduction to Robotics
From Everand
Introduction to Robotics
Swarnalata Verma
No ratings yet
Data Mining for Beginners: A Programmer’s Guide
From Everand
Data Mining for Beginners: A Programmer’s Guide
Agasti Khatri
No ratings yet
Mastering Data Mining Techniques
From Everand
Mastering Data Mining Techniques
Dhaanyalakshmi Ahuja
No ratings yet
Big Data: Statistics, Data Mining, Analytics, And Pattern Learning
From Everand
Big Data: Statistics, Data Mining, Analytics, And Pattern Learning
Rob Botwright
No ratings yet
Data Mining: Fundamentals and Applications
From Everand
Data Mining: Fundamentals and Applications
Fouad Sabry
No ratings yet

DM Lec1

Uploaded by

DM Lec1

Uploaded by

DATA MINING

Data mining is the use of eﬃcient techniques for

• Alternative names and their “inside stories”:

– Look up phone number in phone directory

•It has been estimated that the amount of information

Wikipedia (en, text only)

Rate at which data are produced

Rate at which data can be understood

• Automated data collection tools and mature database technology

lead to tremendous amounts of data stored in databases, data

• Solution: Data warehousing and data mining

• Data warehousing and on-line analytical processing

• Extraction of interesting knowledge (rules, regularities, patterns,

constraints) from data in large databases

• Traditional techniques infeasible for raw data

• Kinds of knowledge to be discovered

• Kinds of techniques utilized

• Kinds of applications adapted

• Data mining tasks

• Predictive data mining

Common data mining tasks

Tan, M. Steinbach and V. Kumar, Introduction to Data Mining

• Recommendations make use of item and user similarity

Tan, M. Steinbach and V. Kumar, Introduction to Data Mining

Tan, M. Steinbach and V. Kumar, Introduction to Data Mining

Intracluster distances Intercluster distances

Tan, M. Steinbach and V. Kumar, Introduction to Data Mining

Tan, M. Steinbach and V. Kumar, Introduction to Data Mining

• Goal: previously unseen records should be

Tan, M. Steinbach and V. Kumar, Introduction to Data Mining

Tan, M. Steinbach and V. Kumar, Introduction to Data Mining

You might also like