0% found this document useful (0 votes)

21 views44 pages

Datamining-Lect1 2

Uploaded by

ahmed.sherif3400

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

21 views44 pages

Datamining-Lect1 2

Uploaded by

ahmed.sherif3400

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 44

DATA MINING

LECTURE 1
Introduction
2

What is data mining?

Data mining is the use of efficient techniques for
the analysis of very large collections of data and the
extraction of useful and possibly unexpected
patterns in data.
• Data mining (knowledge discovery from data)
• Extraction of interesting (non-trivial, implicit, previously unknown and
potentially useful) patterns or knowledge from huge amount of data
3

Why do we need data mining?

• Really, huge amounts of raw data!!
• In the digital age, TB of data is generated by the second
• Mobile devices, digital photographs, web documents.
• Facebook updates, Tweets, Blogs
• Transactions, sensor data, Queries, clicks, browsing
• Cheap storage has made possible to maintain this data
• Major sources of abundant data
• Business: Web, e-commerce, transactions, stocks, …
• Science: Remote sensing, bioinformatics, scientific simulation, …
• Society and everyone: news, digital cameras, YouTube

• We are drowning in data, but starving for knowledge!

• “Necessity is the mother of invention”—Data mining—Automated
analysis of massive data sets

• Need to analyze the raw data to extract knowledge

The data is also very complex

• Multiple types of data: tables, time series,
images, graphs, etc

• Spatial and temporal aspects

• spatial data provides the information that
identifies the location of features and boundaries
on Earth.
5

Example: transaction data

• Billions of real-life customers:
• WALMART: 20M transactions per day
• AT&T 300 M calls per day
• Credit card companies: billions of transactions per day.
6

Example: document data

• Web as a document repository: estimated 50
billions of web pages

• Wikipedia: 4 million articles (and counting)

• Online news portals: steady stream )‫)تدفق مستمر‬of

100’s of new articles every day

• Twitter: ~300 million tweets every day

Example: network data

• Web: 50 billion pages linked via hyperlinks

• Facebook: 500 million users

• Twitter: 300 million users

• Instant messenger: ~1billion users

• Blogs: 250 million blogs worldwide, presidential

candidates run blogs
8

Example: environmental data

• Climate data (just an example)
http://www.ncdc.gov/oa/climate/ghcn-monthly/index.php

• “a database of temperature, precipitation and

pressure records managed by the National Climatic
Data Center, Arizona State University and the Carbon
Dioxide Information Analysis Center”

• “6000 temperature stations, 7500 precipitation

stations, 2000 pressure stations”
• Spatiotemporal data
9

Behavioral data
• Mobile phones today record a large amount of information
about the user behavior
• GPS records position
• Camera produces images
• Communication via phone and SMS
• Text via facebook updates

• Amazon collects all the items that you browsed, placed into
your basket, read reviews about, purchased.

• Google record all your browsing activity via toolbar plugins.

They also record the queries you asked, the pages you saw
and the clicks you did.

• Data collected for millions of users on a daily basis

Types of Attributes
• There are different types of attributes
• Categorical
• Examples: eye color, id number, rankings (e.g, good, fair,
bad), height in {tall, medium, short}
• Nominal (no order) vs Ordinal (order)
• Nominal and Ordinal are collectively referred to as Categorical
or qualitative attributes

• Numeric(quantitative)
• Examples: dates, temperature, time, length, value.
• Interval
• ratio
11

Types of Attributes

An independent way of distinguishing between attributes is by the

number of values they can take:
• Discrete A discrete attribute has a finite or countably infinite set
of values.
• Such attributes can be categorical, such as zip codes or ID
numbers, or numeric, such as counts.
• Discrete attributes are often represented using integer variables.
• Binary attributes are a special case of discrete attributes and assume
only two values, e.g., true/false, yes/no,male/female, or 0/1.
• Continuous A continuous attribute is one whose values are real numbers.
• Examples include attributes such as temperature, height, or weight.
• Continuous attributes are typically represented as floating-point variables.
12
13

Record Data
• Much data mining work assumes that the data set is a
collection of records(data objects), each of which consists
of a fixed set of data fields (attributes).
14

Categorical Data
• Data that consists of a collection of records, each
of which consists of a fixed set of categorical
attributes
Tid Refund Marital Taxable
Status Income Cheat

1 Yes Single High No

2 No Married Medium No
3 No Single Low No
4 Yes Married High No
5 No Divorced Medium Yes
6 No Married Low No
7 Yes Divorced High No
8 No Single Medium Yes
9 No Married Medium No
10 No Single Medium Yes
10
15

Document Data
• Each document becomes a `term' vector,
• each term is a component (attribute) of the vector,
• the value of each component is the number of times the
corresponding term occurs in the document.
• Bag-of-words representation – no ordering

timeout

season
coach

game
score
team

ball

lost
pla

wi
n
y

Document 1 3 0 5 0 2 6 0 2 0 2

Document 2 0 7 0 2 1 0 0 3 0 0

Document 3 0 1 0 0 1 2 2 0 3 0
16

Transaction Data
• Each record (transaction) is a set of items.
TID Items
1 Bread, Coke, Milk
2 Beer, Bread
3 Beer, Coke, Diaper, Milk
4 Beer, Bread, Diaper, Milk
5 Coke, Diaper, Milk

• A document can also be represented as a set of

words (no counts)
17

Ordered Data
• Genomic sequence data

GGTTCCGCCTTCAGCCCCGCGCC
CGCAGGGCCCGCCCCGCGCCGTC
GAGAAGGGCCCGCCTGGCGGGCG
GGGGGAGGCGGGGCCGCCCGAGC
CCAACCGAGTCCGACCAGGTGCC
CCCTCTGCTCGGCCTAGACCTGA
GCTCATTAGGCGGCAGCGGACAG
GCCAAGTAGAACACGCGAAGCGC
TGGGCTGCCTGCTGCGACCAGGG
18

Ordered Data
• Time series
• Sequence of ordered (over “time”) numeric values.
19

Graph Data
• Examples: Web graph and HTML Links
<a href="papers/papers.html#bbbb">
Data Mining </a>
<li>
2 <a href="papers/papers.html#aaaa">
Graph Partitioning </a>
<li>
5 1 <a href="papers/papers.html#aaaa">
Parallel Solution of Sparse Linear System of Equations </a>
<li>
2 <a href="papers/papers.html#ffff">
N-Body Computation and Dense Linear System Solvers
5
Attributes
So, what is Data?
Tid Refund Marital Taxable
• Collection of data objects and Status Income Cheat

their attributes 1 Yes Single 125K No

2 No Married 100K No
• An attribute is a property or 3 No Single 70K No
characteristic of an object 4 Yes Married 120K No
• Examples: eye color of a person, 5 No Divorced 95K Yes
temperature, etc.
Objects
6 No Married 60K No
• Attribute is also known as 7 Yes Divorced 220K No
variable, field, characteristic, or 8 No Single 85K Yes
feature, Dimension 9 No Married 75K No
• A collection of attributes describe 10 No Single 90K Yes
an object
10

• Object is also known as record,

Size: Number of objects
point, case, sample, entity, or
Dimensionality: Number of attributes
instance
21

Data Mining tasks

• Data Mining Tasks
Data mining tasks are generally divided into two major categories:
• Predictive tasks. The objective of these tasks is to predict the value of a particular
attribute based on the values of other attributes. The attribute
to be predicted is commonly known as the target or dependent variable,
while the attributes used for making the prediction are known as
the explanatory or independent variables.

• Descriptive tasks. Here, the objective is to derive patterns (correlations,

trends, clusters, and anomalies) that summarize the underlying
relationships in data.
22

Data Mining tasks

Frequent Itemsets and Association Rules

• Given a set of records each of which contain some
number of items from a given collection;
• Identify sets of items (itemsets) occurring frequently
together
• Produce dependency rules which will predict
occurrence of an item based on occurrences of other
items.
Itemsets Discovered:
TID Items {Milk,Coke}
1 Bread, Coke, Milk {Diaper, Milk}
2 Beer, Bread
3 Beer, Coke, Diaper, Milk Rules Discovered:
4 Beer, Bread, Diaper, Milk {Milk} --> {Coke}
5 Coke, Diaper, Milk {Diaper, Milk} --> {Beer}

Tan, M. Steinbach and V. Kumar, Introduction to Data Mining

Clustering Definition
• Given a set of data points, each having a set of
attributes, and a similarity measure among them,
find clusters such that
• Data points in one cluster are more similar to one
another.
• Data points in separate clusters are less similar to
one another.

Tan, M. Steinbach and V. Kumar, Introduction to Data Mining

Clustering: Application
• Document Clustering:
• Goal: To find groups of documents that are similar to
each other based on the important terms appearing in
them.
• Approach: To identify frequently occurring terms in
each document. Form a similarity measure based on
the frequencies of different terms. Use it to cluster.
• Gain: Information Retrieval can utilize the clusters to
relate a new document or search term to clustered
documents.

Tan, M. Steinbach and V. Kumar, Introduction to Data Mining

Classification: Definition
• Given a collection of records (training set )
• Each record contains a set of attributes, one of the
attributes is the class.
• Find a model for class attribute as a function
of the values of other attributes.

• Goal: previously unseen records should be

assigned a class as accurately as possible.
• A test set is used to determine the accuracy of the
model. Usually, the given data set is divided into
training and test sets, with training set used to build
the model and test set used to validate it.
28

Classification Example
Tid Refund Marital Taxable Refund Marital Taxable
Status Income Cheat Status Income Cheat

1 Yes Single 125K No No Single 75K ?

2 No Married 100K No Yes Married 50K ?
3 No Single 70K No No Married 150K ?
4 Yes Married 120K No Yes Divorced 90K ?
5 No Divorced 95K Yes No Single 40K ?
6 No Married 60K No No Married 80K ? Test
10

Set
7 Yes Divorced 220K No
8 No Single 85K Yes
9 No Married 75K No Learn
Training
10 No Single 90K Yes Model
10

Set Classifier

Tan, M. Steinbach and V. Kumar, Introduction to Data Mining

Classification: Application 1
• Ad Click Prediction
• Goal: Predict if a user that visits a web page will click
on a displayed ad. Use it to target users with high
click probability.
• Approach:
• Collect data for users over a period of time and record who
clicks and who does not. The {click, no click} information
forms the class attribute.
• Use the history of the user (web pages browsed, queries
issued) as the features.
• Learn a classifier model and test on new users.
30

Connections of Data Mining with other

areas
• Draws ideas from machine learning/AI, pattern
recognition, statistics, and database systems
• Traditional Techniques
may be unsuitable due to
Statistics/ Machine Learning/
• Enormity of data AI Pattern
• High dimensionality Recognition
of data
Data Mining
• Heterogeneous,
distributed nature
of data Database
systems

Tan, M. Steinbach and V. Kumar, Introduction to Data Mining

Data Mining: Confluence of Multiple Disciplines

Database
Technology Statistics

Machine Visualization
Data Mining
Learning

Pattern
Recognition Other
Disciplines
32

Knowledge Discovery (KDD) Process

• Data cleaning
• Data integration from multiple sources
• Data selection for data mining
• Data transformation
• Data mining
• Patterns evaluation
• Presentation of the mining results
34

Knowledge Discovery (KDD) Process

7
38

The data analysis pipeline

• Mining is not the only step in the analysis process

Data Result
Preprocessing Data Mining Post-processing

• Preprocessing: real data is noisy, incomplete and inconsistent. Data cleaning

is required to make sense of the data
• Techniques: Sampling, Dimensionality Reduction, Feature selection.
• A dirty work, but it is often the
most important step for the analysis.
• Post-Processing: Make the data actionable and useful to the user
• Statistical analysis of importance
• Visualization.
• Pre- and Post-processing are often data mining tasks as well
• Because of the many ways data can be collected and stored, data
preprocessing is perhaps the most laborious and time-
consuming step in the overall knowledge discovery process.
Post-processing
• Visualization
• The human eye is a powerful analytical tool
• If we visualize the data properly, we can discover
patterns
• Visualization is the way to present the data so that
patterns can be seen
• E.g., histograms and plots are a form of visualization
• There are multiple techniques (a field on its own)
Data Quality
• Examples of data quality problems:
• Noise and outliers
Tid Refund Marital Taxable
• Missing values Status Income Cheat

• Duplicate data 1 Yes Single 125K No

2 No Married 100K No
3 No Single 70K No
4 Yes Married 120K No

A mistake or a millionaire? 5 No Divorced 10000K Yes

6 No NULL 60K No
Missing values 7 Yes Divorced 220K NULL
8 No Single 85K Yes
9 No Married 90K No
Inconsistent duplicate entries
9 No Single 90K No
10
41

Sampling
• Sampling is the main technique employed for data
selection.
• It is often used for both the preliminary investigation of the data and
the final data analysis.

• Statisticians sample because obtaining the entire set of

data of interest is too expensive or time consuming.

• Sampling is used in data mining because processing the

entire set of data of interest is too expensive or time
consuming.
42

Sampling …
• The key principle for effective sampling is the
following:
• using a sample will work almost as well as using the
entire data sets, if the sample is representative

• A sample is representative if it has approximately the

same property (of interest) as the original set of data
43

Types of Sampling
• Simple Random Sampling
• There is an equal probability of selecting any particular item

• Sampling without replacement

• As each item is selected, it is removed from the population

• Sampling with replacement

• Objects are not removed from the population as they are selected
for the sample.
• In sampling with replacement, the same object can be picked up more
than once

• Stratified sampling
• Split the data into several partitions; then draw random samples
from each partition
44

Sample Size

8000 points 2000 Points 500 Points

Nptel Swayam DWDM Slides
No ratings yet
Nptel Swayam DWDM Slides
406 pages
Medical Statistics at A Glance. ISBN 1119167817, 978-1119167815
100% (34)
Medical Statistics at A Glance. ISBN 1119167817, 978-1119167815
23 pages
Chapter 1 Data and Statistics Cengage
100% (4)
Chapter 1 Data and Statistics Cengage
31 pages
Datamining Lect1
No ratings yet
Datamining Lect1
61 pages
Module 1 - Aug 2024
No ratings yet
Module 1 - Aug 2024
93 pages
Data-Mining FINAL
No ratings yet
Data-Mining FINAL
45 pages
2020 Intro
No ratings yet
2020 Intro
58 pages
Datamining 1class
No ratings yet
Datamining 1class
76 pages
Module 1 Part1
No ratings yet
Module 1 Part1
68 pages
COEN413 Machine Learning-2
No ratings yet
COEN413 Machine Learning-2
38 pages
DATA MINING For Search Engines
No ratings yet
DATA MINING For Search Engines
33 pages
Updated DM
No ratings yet
Updated DM
72 pages
R21 DM Unit1
No ratings yet
R21 DM Unit1
77 pages
L1
No ratings yet
L1
44 pages
What Is Data Mining?
No ratings yet
What Is Data Mining?
17 pages
Unit1 Data Preprocessing
No ratings yet
Unit1 Data Preprocessing
95 pages
Lecture Notes For Chapter 2 Introduction To Data Mining
No ratings yet
Lecture Notes For Chapter 2 Introduction To Data Mining
34 pages
Full
No ratings yet
Full
367 pages
Unit-II Notes
No ratings yet
Unit-II Notes
9 pages
Lecture 1
No ratings yet
Lecture 1
55 pages
Data Warehousing and Data Mining: DR Seema Agarwal
No ratings yet
Data Warehousing and Data Mining: DR Seema Agarwal
72 pages
CS822 DataMining Week1
No ratings yet
CS822 DataMining Week1
97 pages
02 - Data Mining
No ratings yet
02 - Data Mining
27 pages
Wk. 1. Introduction (08.10.2020)
No ratings yet
Wk. 1. Introduction (08.10.2020)
30 pages
DM Lec1
No ratings yet
DM Lec1
40 pages
DM Day2 DataUnderstanding MS S25
No ratings yet
DM Day2 DataUnderstanding MS S25
165 pages
ITS632 Lecture2 Data
No ratings yet
ITS632 Lecture2 Data
61 pages
DM Lec1 2
No ratings yet
DM Lec1 2
39 pages
Lecture#1-Data Mining-MS (DEIM) - Spring 2025
No ratings yet
Lecture#1-Data Mining-MS (DEIM) - Spring 2025
33 pages
Basic Data Mining Techniques: Attributes
No ratings yet
Basic Data Mining Techniques: Attributes
12 pages
Lecture 2
No ratings yet
Lecture 2
27 pages
Modified Module 2-DM
No ratings yet
Modified Module 2-DM
107 pages
TTDS Lecture 1
No ratings yet
TTDS Lecture 1
22 pages
Data Mining - Lecture 1
No ratings yet
Data Mining - Lecture 1
33 pages
Mining
No ratings yet
Mining
129 pages
Datamining Lect1
No ratings yet
Datamining Lect1
59 pages
3 DM
No ratings yet
3 DM
36 pages
Data Mining Course Overview
No ratings yet
Data Mining Course Overview
38 pages
Lecture Notes For Chapter 2: by Tan, Steinbach, Kumar
No ratings yet
Lecture Notes For Chapter 2: by Tan, Steinbach, Kumar
25 pages
DMlecture 1
No ratings yet
DMlecture 1
39 pages
Data Mining CH2
No ratings yet
Data Mining CH2
69 pages
Clustering Vivek Saxena
No ratings yet
Clustering Vivek Saxena
169 pages
Wk. 3. Data (12-05-2021)
No ratings yet
Wk. 3. Data (12-05-2021)
57 pages
Unit 1
No ratings yet
Unit 1
28 pages
Ragb Alllnkg Kyoulltherrdz: in Structor
No ratings yet
Ragb Alllnkg Kyoulltherrdz: in Structor
31 pages
III-IT-Data Mining Unit 1-Session 3
No ratings yet
III-IT-Data Mining Unit 1-Session 3
21 pages
ITS665dm Topic2-DataUnderstanding
No ratings yet
ITS665dm Topic2-DataUnderstanding
53 pages
Data Mining
No ratings yet
Data Mining
33 pages
02 Data
No ratings yet
02 Data
47 pages
Datamining-Lect2 - What Is Data - The Data Mining Pipeline. Preprocessing and Postprocessing. Samping and Normalization
No ratings yet
Datamining-Lect2 - What Is Data - The Data Mining Pipeline. Preprocessing and Postprocessing. Samping and Normalization
94 pages
Unit 2 Data Preprocessing For Students
No ratings yet
Unit 2 Data Preprocessing For Students
169 pages
DM Unit1 - 1 INTRODUCTION TO DATA MINING and Types of Data 19I504
No ratings yet
DM Unit1 - 1 INTRODUCTION TO DATA MINING and Types of Data 19I504
42 pages
1 Lect - 1.2 - 12 - August 2022 PDF
No ratings yet
1 Lect - 1.2 - 12 - August 2022 PDF
59 pages
មេរៀនទី១
No ratings yet
មេរៀនទី១
40 pages
Data Science Mid Syllabus
No ratings yet
Data Science Mid Syllabus
102 pages
Data Mining For Exam
No ratings yet
Data Mining For Exam
10 pages
Chapter 2
No ratings yet
Chapter 2
57 pages
Chapter 3: Data Mining
No ratings yet
Chapter 3: Data Mining
20 pages
Getting To Know Your Data: - Chapter 2
No ratings yet
Getting To Know Your Data: - Chapter 2
63 pages
Data Management
No ratings yet
Data Management
36 pages
data science course training in india hyderabad: innomatics research labs
From Everand
data science course training in india hyderabad: innomatics research labs
innomatics research labs
No ratings yet
(Excerpts From) Investigating Performance: Design and Outcomes With Xapi
From Everand
(Excerpts From) Investigating Performance: Design and Outcomes With Xapi
Janet Laane Effron
No ratings yet
What Statistical Analysis Should I Use?: Sunday, June 4, 2017 04:22 AM
No ratings yet
What Statistical Analysis Should I Use?: Sunday, June 4, 2017 04:22 AM
364 pages
A Complete Guide To Survival Analysis in Python, Part 3 - KDnuggets
No ratings yet
A Complete Guide To Survival Analysis in Python, Part 3 - KDnuggets
22 pages
Business Statistics:: The Where, Why, and How of Data Collection
No ratings yet
Business Statistics:: The Where, Why, and How of Data Collection
33 pages
One-Way ANOVA Is Used To Test If The Means of Two or More Groups Are Significantly Different
No ratings yet
One-Way ANOVA Is Used To Test If The Means of Two or More Groups Are Significantly Different
17 pages
Microsoft Word - 01 SBST1303 CP PDF
No ratings yet
Microsoft Word - 01 SBST1303 CP PDF
222 pages
DSE 2151 24 Sep 2022
No ratings yet
DSE 2151 24 Sep 2022
5 pages
Industrial and Organizational Psychology Research and Practice 8th Edition by Paul e Spector Test Bank
No ratings yet
Industrial and Organizational Psychology Research and Practice 8th Edition by Paul e Spector Test Bank
30 pages
Corporate Governance and Stock Market Liquidity in India by
No ratings yet
Corporate Governance and Stock Market Liquidity in India by
25 pages
DM Questions
No ratings yet
DM Questions
7 pages
Educ 502 1 1
100% (1)
Educ 502 1 1
70 pages
Unit 2 DS
No ratings yet
Unit 2 DS
10 pages
Prot SAP 000 PDF
No ratings yet
Prot SAP 000 PDF
20 pages
Statistical Fundamentals Using Microsoft Excel For Univariate and Bivariate Analysis by Rovai A.P.
No ratings yet
Statistical Fundamentals Using Microsoft Excel For Univariate and Bivariate Analysis by Rovai A.P.
628 pages
Appreciative Attitudes Toward Atheists
No ratings yet
Appreciative Attitudes Toward Atheists
22 pages
A Powerpoint®-Based Guide To Assist in Choosing The Suitable Statistical Test
No ratings yet
A Powerpoint®-Based Guide To Assist in Choosing The Suitable Statistical Test
43 pages
A Quick Free Somewhat Easy-To-Read Introduction To Empirical So
No ratings yet
A Quick Free Somewhat Easy-To-Read Introduction To Empirical So
71 pages
Introduction To Statistics
100% (1)
Introduction To Statistics
25 pages
Statistics - Functions, Importance & Limitations 4th Sem
No ratings yet
Statistics - Functions, Importance & Limitations 4th Sem
47 pages
Past Year Exercise STA404
No ratings yet
Past Year Exercise STA404
3 pages
Introductory Statistics Using Spss Compress
No ratings yet
Introductory Statistics Using Spss Compress
419 pages
BDMDM
No ratings yet
BDMDM
2 pages
Finance Econometrics Paper
No ratings yet
Finance Econometrics Paper
4 pages
The Oxford Handbook Quantitative Methods: Todd D. Little
No ratings yet
The Oxford Handbook Quantitative Methods: Todd D. Little
63 pages
Factors Affecting The Passing Rate of Students in Isap: A Basis For Review Program Enhancement
No ratings yet
Factors Affecting The Passing Rate of Students in Isap: A Basis For Review Program Enhancement
13 pages
ML 3170724 Unit-4
No ratings yet
ML 3170724 Unit-4
97 pages
Statistics Presentation 1
No ratings yet
Statistics Presentation 1
64 pages
Assignment
No ratings yet
Assignment
9 pages
M5 m6 KC
No ratings yet
M5 m6 KC
36 pages

Datamining-Lect1 2

Uploaded by

Datamining-Lect1 2

Uploaded by

DATA MINING

What is data mining?

Why do we need data mining?

• We are drowning in data, but starving for knowledge!

• Need to analyze the raw data to extract knowledge

The data is also very complex

• Spatial and temporal aspects

Example: transaction data

Example: document data

• Wikipedia: 4 million articles (and counting)

• Online news portals: steady stream )‫)تدفق مستمر‬of

• Twitter: ~300 million tweets every day

Example: network data

• Facebook: 500 million users

• Twitter: 300 million users

• Instant messenger: ~1billion users

• Blogs: 250 million blogs worldwide, presidential

Example: environmental data

• “a database of temperature, precipitation and

• “6000 temperature stations, 7500 precipitation

• Google record all your browsing activity via toolbar plugins.

• Data collected for millions of users on a daily basis

An independent way of distinguishing between attributes is by the

1 Yes Single High No

• A document can also be represented as a set of

their attributes 1 Yes Single 125K No

• Object is also known as record,

Data Mining tasks

• Descriptive tasks. Here, the objective is to derive patterns (correlations,

Data Mining tasks

Data Mining tasks

Frequent Itemsets and Association Rules

Tan, M. Steinbach and V. Kumar, Introduction to Data Mining

Tan, M. Steinbach and V. Kumar, Introduction to Data Mining

Tan, M. Steinbach and V. Kumar, Introduction to Data Mining

• Goal: previously unseen records should be

1 Yes Single 125K No No Single 75K ?

Tan, M. Steinbach and V. Kumar, Introduction to Data Mining

Connections of Data Mining with other

Tan, M. Steinbach and V. Kumar, Introduction to Data Mining

Data Mining: Confluence of Multiple Disciplines

Knowledge Discovery (KDD) Process

Knowledge Discovery (KDD) Process

Knowledge Discovery (KDD) Process

Knowledge Discovery (KDD) Process

Knowledge Discovery (KDD) Process

Knowledge Discovery (KDD) Process

The data analysis pipeline

• Preprocessing: real data is noisy, incomplete and inconsistent. Data cleaning

• Duplicate data 1 Yes Single 125K No

A mistake or a millionaire? 5 No Divorced 10000K Yes

• Statisticians sample because obtaining the entire set of

• Sampling is used in data mining because processing the

• A sample is representative if it has approximately the

• Sampling without replacement

• Sampling with replacement

8000 points 2000 Points 500 Points

You might also like