0% found this document useful (0 votes)

10 views

Datascience

Uploaded by

Rimsha Rao

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

10 views

Datascience

Uploaded by

Rimsha Rao

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 28

Data Preprocessing

l Aggregation
l Sampling
l Dimensionality Reduction
l Feature subset selection
l Feature creation
l Discretization and Binarization
l Attribute Transformation

01/27/2020 Introduction to Data Mining, 2nd Edition 68

Tan, Steinbach, Karpatne, Kumar
Aggregation

l Combining two or more attributes (or objects) into

a single attribute (or object)

l Purpose
– Data reduction
u Reduce the number of attributes or objects
– Change of scale
u Cities aggregated into regions, states, countries, etc.
u Days aggregated into weeks, months, or years

– More “stable” data

u Aggregated data tends to have less variability

01/27/2020 Introduction to Data Mining, 2nd Edition 69

Tan, Steinbach, Karpatne, Kumar
Example: Precipitation in Australia

l This example is based on precipitation in

Australia from the period 1982 to 1993.
The next slide shows
– A histogram for the standard deviation of average
monthly precipitation for 3,030 0.5◦ by 0.5◦ grid cells in
Australia, and
– A histogram for the standard deviation of the average
yearly precipitation for the same locations.
l The average yearly precipitation has less
variability than the average monthly precipitation.
l All precipitation measurements (and their
standard deviations) are in centimeters.
01/27/2020 Introduction to Data Mining, 2nd Edition 70
Tan, Steinbach, Karpatne, Kumar
Example: Precipitation in Australia …

Variation of Precipitation in Australia

Standard Deviation of Average Standard Deviation of

Monthly Precipitation Average Yearly Precipitation
01/27/2020 Introduction to Data Mining, 2nd Edition 71
Tan, Steinbach, Karpatne, Kumar
Sampling
l Sampling is the main technique employed for data
reduction.
– It is often used for both the preliminary investigation of
the data and the final data analysis.

l Statisticians often sample because obtaining the

entire set of data of interest is too expensive or
time consuming.

l Sampling is typically used in data mining because

processing the entire set of data of interest is too
expensive or time consuming.

01/27/2020 Introduction to Data Mining, 2nd Edition 72

Tan, Steinbach, Karpatne, Kumar
Sampling …

l The key principle for effective sampling is the

following:

– Using a sample will work almost as well as using the

entire data set, if the sample is representative

– A sample is representative if it has approximately the

same properties (of interest) as the original set of data

01/27/2020 Introduction to Data Mining, 2nd Edition 73

Tan, Steinbach, Karpatne, Kumar
Sample Size

8000 points 2000 Points 500 Points

01/27/2020 Introduction to Data Mining, 2nd Edition 74

Tan, Steinbach, Karpatne, Kumar
Types of Sampling
l Simple Random Sampling
– There is an equal probability of selecting any particular
item
– Sampling without replacement
u As each item is selected, it is removed from the
population
– Sampling with replacement
u Objects are not removed from the population as they
are selected for the sample.
u In sampling with replacement, the same object can
be picked up more than once
l Stratified sampling
– Split the data into several partitions; then draw random
samples from each partition

01/27/2020 Introduction to Data Mining, 2nd Edition 75

Tan, Steinbach, Karpatne, Kumar
Sample Size
l What sample size is necessary to get at least one
object from each of 10 equal-sized groups.

01/27/2020 Introduction to Data Mining, 2nd Edition 76

Tan, Steinbach, Karpatne, Kumar
Curse of Dimensionality

l When dimensionality
increases, data becomes
increasingly sparse in the
space that it occupies

l Definitions of density and

distance between points,
which are critical for
clustering and outlier
detection, become less
meaningful •Randomly generate 500 points
•Compute difference between max and
min distance between any pair of points
01/27/2020 Introduction to Data Mining, 2nd Edition 77
Tan, Steinbach, Karpatne, Kumar
Dimensionality Reduction

l Purpose:
– Avoid curse of dimensionality
– Reduce amount of time and memory required by data
mining algorithms
– Allow data to be more easily visualized
– May help to eliminate irrelevant features or reduce
noise

l Techniques
– Principal Components Analysis (PCA)
– Singular Value Decomposition
– Others: supervised and non-linear techniques

01/27/2020 Introduction to Data Mining, 2nd Edition 78

Tan, Steinbach, Karpatne, Kumar
Dimensionality Reduction: PCA

l Goal is to find a projection that captures the

largest amount of variation in data
x2

x1
01/27/2020 Introduction to Data Mining, 2nd Edition 79
Tan, Steinbach, Karpatne, Kumar
Dimensionality Reduction: PCA

01/27/2020 Introduction to Data Mining, 2nd Edition 80

Tan, Steinbach, Karpatne, Kumar
Feature Subset Selection

l Another way to reduce dimensionality of data

l Redundant features
– Duplicate much or all of the information contained in
one or more other attributes
– Example: purchase price of a product and the amount
of sales tax paid
l Irrelevant features
– Contain no information that is useful for the data
mining task at hand
– Example: students' ID is often irrelevant to the task of
predicting students' GPA
l Many techniques developed, especially for
classification
01/27/2020 Introduction to Data Mining, 2nd Edition 81
Tan, Steinbach, Karpatne, Kumar
Feature Creation

l Create new attributes that can capture the

important information in a data set much more
efficiently than the original attributes

l Three general methodologies:

– Feature extraction
u Example: extracting edges from images
– Feature construction
u Example: dividing mass by volume to get density
– Mapping data to new space
u Example: Fourier and wavelet analysis

01/27/2020 Introduction to Data Mining, 2nd Edition 82

Tan, Steinbach, Karpatne, Kumar
Mapping Data to a New Space

l Fourier and wavelet transform

Frequency

Two Sine Waves + Noise Frequency

01/27/2020 Introduction to Data Mining, 2nd Edition 83

Tan, Steinbach, Karpatne, Kumar
Discretization

l Discretization is the process of converting a

continuous attribute into an ordinal attribute
– A potentially infinite number of values are mapped into
a small number of categories
– Discretization is commonly used in classification
– Many classification algorithms work best if both
the independent and dependent variables have
only a few values
– We give an illustration of the usefulness of
discretization using the Iris data set

01/27/2020 Introduction to Data Mining, 2nd Edition 84

Tan, Steinbach, Karpatne, Kumar
Iris Sample Data Set

l Iris Plant data set.

– Can be obtained from the UCI Machine Learning Repository
http://www.ics.uci.edu/~mlearn/MLRepository.html
– From the statistician Douglas Fisher
– Three flower types (classes):
u Setosa
u Versicolour
u Virginica
– Four (non-class) attributes
u Sepal width and length
u Petal width and length Virginica. Robert H. Mohlenbrock. USDA
NRCS. 1995. Northeast wetland flora: Field
office guide to plant species. Northeast National
Technical Center, Chester, PA. Courtesy of
USDA NRCS Wetland Science Institute.
01/27/2020 Introduction to Data Mining, 2nd Edition 85
Tan, Steinbach, Karpatne, Kumar
Discretization: Iris Example

Petal width low or petal length low implies Setosa.

Petal width medium or petal length medium implies Versicolour.
Petal width high or petal length high implies Virginica.
Discretization: Iris Example …

l How can we tell what the best discretization is?

– Unsupervised discretization: find breaks in the data
values 50
u Example:
Petal Length 40

Counts
20

0
0 2 4 6 8
Petal Length

– Supervised discretization: Use class labels to find

breaks
01/27/2020 Introduction to Data Mining, 2nd Edition 87
Tan, Steinbach, Karpatne, Kumar
Discretization Without Using Class Labels

Data consists of four groups of points and two outliers. Data is one-
dimensional, but a random y component is added to reduce overlap.

01/27/2020 Introduction to Data Mining, 2nd Edition 88

Tan, Steinbach, Karpatne, Kumar
Discretization Without Using Class Labels

Equal interval width approach used to obtain 4 values.

01/27/2020 Introduction to Data Mining, 2nd Edition 89

Tan, Steinbach, Karpatne, Kumar
Discretization Without Using Class Labels

Equal frequency approach used to obtain 4 values.

01/27/2020 Introduction to Data Mining, 2nd Edition 90

Tan, Steinbach, Karpatne, Kumar
Discretization Without Using Class Labels

K-means approach to obtain 4 values.

01/27/2020 Introduction to Data Mining, 2nd Edition 91

Tan, Steinbach, Karpatne, Kumar
Binarization

l Binarization maps a continuous or categorical

attribute into one or more binary variables

l Typically used for association analysis

l Often convert a continuous attribute to a

categorical attribute and then convert a
categorical attribute to a set of binary attributes
– Association analysis needs asymmetric binary
attributes
– Examples: eye color and height measured as
{low, medium, high}
01/27/2020 Introduction to Data Mining, 2nd Edition 92
Tan, Steinbach, Karpatne, Kumar
Attribute Transformation

l An attribute transform is a function that maps the

entire set of values of a given attribute to a new
set of replacement values such that each old
value can be identified with one of the new values
– Simple functions: xk, log(x), ex, |x|
– Normalization
Refers to various techniques to adjust to
u
differences among attributes in terms of frequency
of occurrence, mean, variance, range
u Take out unwanted, common signal, e.g.,
seasonality
– In statistics, standardization refers to subtracting off
the means and dividing by the standard deviation
01/27/2020 Introduction to Data Mining, 2nd Edition 93
Tan, Steinbach, Karpatne, Kumar
Example: Sample Time Series of Plant Growth
Minneapolis

Net Primary
Production (NPP)
is a measure of
plant growth used
by ecosystem
scientists.

Correlations between time series

Correlations between time series
Minneapolis Atlanta Sao Paolo
Minneapolis 1.0000 0.7591 -0.7581
Atlanta 0.7591 1.0000 -0.5739
Sao Paolo -0.7581 -0.5739 1.0000
01/27/2020 Introduction to Data Mining, 2nd Edition 94
Tan, Steinbach, Karpatne, Kumar
Seasonality Accounts for Much Correlation
Minneapolis
Normalized using
monthly Z Score:
Subtract off monthly
mean and divide by
monthly standard
deviation

Correlations between time series

Correlations between time series
Minneapolis Atlanta Sao Paolo
Minneapolis 1.0000 0.0492 0.0906
Atlanta 0.0492 1.0000 -0.0154
Sao Paolo 0.0906 -0.0154 1.0000
01/27/2020 Introduction to Data Mining, 2nd Edition 95
Tan, Steinbach, Karpatne, Kumar

Lecture Notes For Chapter 2 Introduction To Data Mining, 2 Edition
No ratings yet
Lecture Notes For Chapter 2 Introduction To Data Mining, 2 Edition
96 pages
Preprocessing
No ratings yet
Preprocessing
20 pages
DM Lect3 4
No ratings yet
DM Lect3 4
30 pages
Lecture Notes For Chapter 2 Introduction To Data Mining, 2 Edition
No ratings yet
Lecture Notes For Chapter 2 Introduction To Data Mining, 2 Edition
87 pages
chap2_data (1)
No ratings yet
chap2_data (1)
105 pages
Chap2 Data
No ratings yet
Chap2 Data
87 pages
CH2 data 1
No ratings yet
CH2 data 1
35 pages
UFE Lecture-1 Overview Data
No ratings yet
UFE Lecture-1 Overview Data
42 pages
DM 2 Part 2
No ratings yet
DM 2 Part 2
35 pages
All Data Mining Chapters
No ratings yet
All Data Mining Chapters
235 pages
2DMT
No ratings yet
2DMT
73 pages
Chap2 Data
No ratings yet
Chap2 Data
92 pages
lec1
No ratings yet
lec1
27 pages
3 - Introduction To Data
No ratings yet
3 - Introduction To Data
56 pages
Chap2 Data
No ratings yet
Chap2 Data
78 pages
R21 Unit 2
No ratings yet
R21 Unit 2
101 pages
Unit4 Cluster Analysis 10oct
No ratings yet
Unit4 Cluster Analysis 10oct
133 pages
Chap5 Basic Cluster Analysis
No ratings yet
Chap5 Basic Cluster Analysis
110 pages
Chap2 Data
No ratings yet
Chap2 Data
86 pages
3_Introduction to Data (3)
No ratings yet
3_Introduction to Data (3)
55 pages
Handling Continuous Attributes: Different Kinds of Rules
No ratings yet
Handling Continuous Attributes: Different Kinds of Rules
33 pages
Association Analysis: Advance Concepts Introduction To Data Mining, 2 Edition by Tan, Steinbach, Karpatne, Kumar
No ratings yet
Association Analysis: Advance Concepts Introduction To Data Mining, 2 Edition by Tan, Steinbach, Karpatne, Kumar
87 pages
APznzaYWmQ-qTobj5RXzz8-xtNTobjIxUBBK2CZPI-jNfIhVqkF8b7cZ1tNuaihsGv4VttsFBJ5w8X_jB6b8UegcEnFTG3Rxj-fuGplOc4YDZDKmOqayvVrdoHINtkuN-c4OgbbeX9-btpgsT__OEpp7NeVkQh3HGSQfs_p5pWsx9Et69wyRSeULRuX9f3pX8A4L8v1-fJ7
No ratings yet
APznzaYWmQ-qTobj5RXzz8-xtNTobjIxUBBK2CZPI-jNfIhVqkF8b7cZ1tNuaihsGv4VttsFBJ5w8X_jB6b8UegcEnFTG3Rxj-fuGplOc4YDZDKmOqayvVrdoHINtkuN-c4OgbbeX9-btpgsT__OEpp7NeVkQh3HGSQfs_p5pWsx9Et69wyRSeULRuX9f3pX8A4L8v1-fJ7
67 pages
Chap7 Extended Association Analysis
No ratings yet
Chap7 Extended Association Analysis
67 pages
Chap8 Advanced Cluster Analysis
No ratings yet
Chap8 Advanced Cluster Analysis
45 pages
Lecture Notes For Chapter 3: by Tan, Steinbach, Kumar
No ratings yet
Lecture Notes For Chapter 3: by Tan, Steinbach, Kumar
35 pages
Introduction To Data Mining
100% (1)
Introduction To Data Mining
643 pages
Chap6 Advanced Association Analysis
No ratings yet
Chap6 Advanced Association Analysis
85 pages
Data Mining - Cluster Analysis Basic Concepts and Algorithms
No ratings yet
Data Mining - Cluster Analysis Basic Concepts and Algorithms
98 pages
DM Consolidated
100% (1)
DM Consolidated
676 pages
Lecture Notes For Chapter 1 Introduction To Data Mining
No ratings yet
Lecture Notes For Chapter 1 Introduction To Data Mining
16 pages
Lecture Notes For Chapter 7 Introduction To Data Mining: by Tan, Steinbach, Kumar
No ratings yet
Lecture Notes For Chapter 7 Introduction To Data Mining: by Tan, Steinbach, Kumar
67 pages
Chap1 Intro
No ratings yet
Chap1 Intro
32 pages
Lecture Notes For Chapter 2 Introduction To Data Mining
No ratings yet
Lecture Notes For Chapter 2 Introduction To Data Mining
34 pages
Lecture Notes For Chapter 2 Introduction To Data Mining: by Tan, Steinbach, Kumar
100% (1)
Lecture Notes For Chapter 2 Introduction To Data Mining: by Tan, Steinbach, Kumar
66 pages
R9wsIP9uD8GJeqCD5u8yHAJEhfrXGGOonQSwonhc
No ratings yet
R9wsIP9uD8GJeqCD5u8yHAJEhfrXGGOonQSwonhc
89 pages
Lecture Notes For Chapter 10 Introduction To Data Mining: by Tan, Steinbach, Kumar
No ratings yet
Lecture Notes For Chapter 10 Introduction To Data Mining: by Tan, Steinbach, Kumar
24 pages
Chap10 Anomaly Detection
No ratings yet
Chap10 Anomaly Detection
24 pages
DMBI Simplified
No ratings yet
DMBI Simplified
28 pages
BITS-WASE-DATA MINING-Session-07-2015 PDF
No ratings yet
BITS-WASE-DATA MINING-Session-07-2015 PDF
25 pages
1. Performance Evaluation
No ratings yet
1. Performance Evaluation
56 pages
CH1-data Preprocessing
No ratings yet
CH1-data Preprocessing
49 pages
Lecture Notes For Chapter 1 Introduction To Data Mining: by Tan, Steinbach, Kumar
No ratings yet
Lecture Notes For Chapter 1 Introduction To Data Mining: by Tan, Steinbach, Kumar
31 pages
Lecture Notes For Chapter 1 Introduction To Data Mining: by Tan, Steinbach, Kumar
No ratings yet
Lecture Notes For Chapter 1 Introduction To Data Mining: by Tan, Steinbach, Kumar
32 pages
Summary Statistics: Summary Statistics Are Numbers That Summarize Properties of The Data
No ratings yet
Summary Statistics: Summary Statistics Are Numbers That Summarize Properties of The Data
20 pages
Lecture Notes For Chapter 1 Introduction To Data Mining: by Tan, Steinbach, Kumar
No ratings yet
Lecture Notes For Chapter 1 Introduction To Data Mining: by Tan, Steinbach, Kumar
34 pages
Figures For Chapter 5: by Tan, Steinbach, Kumar
No ratings yet
Figures For Chapter 5: by Tan, Steinbach, Kumar
50 pages
Lecture Notes For Chapter 1: by Tan, Steinbach, Kumar
No ratings yet
Lecture Notes For Chapter 1: by Tan, Steinbach, Kumar
34 pages
Attribute Type Description Examples Operations: Attribute Level Transformation Comments
No ratings yet
Attribute Type Description Examples Operations: Attribute Level Transformation Comments
33 pages
BITS-WASE-DATA MINING-Session-2 PDF
No ratings yet
BITS-WASE-DATA MINING-Session-2 PDF
47 pages
DWDM AR16 Unit 1.2
No ratings yet
DWDM AR16 Unit 1.2
14 pages
Concepts (PPT) - Data Preprocessing
No ratings yet
Concepts (PPT) - Data Preprocessing
19 pages
Chap9 Anomaly Detection
No ratings yet
Chap9 Anomaly Detection
46 pages
Chap2 Data
No ratings yet
Chap2 Data
68 pages
Data Mining K-Means Algorithm
No ratings yet
Data Mining K-Means Algorithm
36 pages
Técnicas Estadísticas para la Ciencia de Datos a través de R. Aprendizaje Supervisado: Análisis Discriminante, Árboles de Decisión, Redes Neuronales y Modelos Lineales Generalizados
From Everand
Técnicas Estadísticas para la Ciencia de Datos a través de R. Aprendizaje Supervisado: Análisis Discriminante, Árboles de Decisión, Redes Neuronales y Modelos Lineales Generalizados
César Pérez López
No ratings yet
Business Statistics I Essentials
From Everand
Business Statistics I Essentials
Louise Clark
5/5 (5)
Elementary Statistics
From Everand
Elementary Statistics
jay prakash Maheshwari
5/5 (1)
Introduction to Robotics
From Everand
Introduction to Robotics
Swarnalata Verma
No ratings yet
Decision Tree Pruning: Fundamentals and Applications
From Everand
Decision Tree Pruning: Fundamentals and Applications
Fouad Sabry
No ratings yet
Network Security Unit 3
No ratings yet
Network Security Unit 3
27 pages
IG
No ratings yet
IG
6 pages
Introduction To Discrete Event Systems
No ratings yet
Introduction To Discrete Event Systems
10 pages
Enterprise Network Services Design (IPv6)
No ratings yet
Enterprise Network Services Design (IPv6)
1 page
ICT APPLICATION IN CIVIL ENGINEERING (Micklenge)
No ratings yet
ICT APPLICATION IN CIVIL ENGINEERING (Micklenge)
2 pages
Transducers Lab. Manual
No ratings yet
Transducers Lab. Manual
32 pages
MIPS Report File
No ratings yet
MIPS Report File
17 pages
X10i Eval 56-16526-6
No ratings yet
X10i Eval 56-16526-6
8 pages
MDRAW Lab MODULE Rev 3
No ratings yet
MDRAW Lab MODULE Rev 3
85 pages
PDF Feur Support CTLG 3ef019 1901e 01
No ratings yet
PDF Feur Support CTLG 3ef019 1901e 01
308 pages
Measurement and Scaling
No ratings yet
Measurement and Scaling
13 pages
Cst206 Scheme
No ratings yet
Cst206 Scheme
4 pages
First Boot
No ratings yet
First Boot
3,379 pages
White Mills 2014
No ratings yet
White Mills 2014
16 pages
Data Warehouse Dissertation Topics
100% (2)
Data Warehouse Dissertation Topics
7 pages
OOAD Unit-2
No ratings yet
OOAD Unit-2
12 pages
SeqSer G
No ratings yet
SeqSer G
8 pages
Remedial Worksheets
No ratings yet
Remedial Worksheets
2 pages
? How to Install Kali Linux in VMware ?
No ratings yet
? How to Install Kali Linux in VMware ?
3 pages
Vmware Migration Scenarios 986062 (1)
No ratings yet
Vmware Migration Scenarios 986062 (1)
13 pages
Practica 9 Grupo 7 Dennis Chura
No ratings yet
Practica 9 Grupo 7 Dennis Chura
11 pages
OWASP Top Ten Proactive Controls v3
No ratings yet
OWASP Top Ten Proactive Controls v3
110 pages
Top 25 Technical Support Interview Questions With Answers
No ratings yet
Top 25 Technical Support Interview Questions With Answers
19 pages
0071822283quadcopter PDF
No ratings yet
0071822283quadcopter PDF
364 pages
Xlr8 Ddr4 3200Mhz: Desktop Memory
No ratings yet
Xlr8 Ddr4 3200Mhz: Desktop Memory
1 page
Sample Cover Letter For Livelihood Officer
100% (5)
Sample Cover Letter For Livelihood Officer
9 pages
Taglist 1
No ratings yet
Taglist 1
1 page
HassaanMurtaza SOP
No ratings yet
HassaanMurtaza SOP
1 page
Installation Procedure SP2130 - SpiralVibDamper
No ratings yet
Installation Procedure SP2130 - SpiralVibDamper
2 pages
I TUTOR Business Plan
No ratings yet
I TUTOR Business Plan
53 pages

Datascience

Uploaded by

Datascience

Uploaded by

Data Preprocessing

01/27/2020 Introduction to Data Mining, 2nd Edition 68

l Combining two or more attributes (or objects) into

– More “stable” data

01/27/2020 Introduction to Data Mining, 2nd Edition 69

l This example is based on precipitation in

Variation of Precipitation in Australia

Standard Deviation of Average Standard Deviation of

l Statisticians often sample because obtaining the

l Sampling is typically used in data mining because

01/27/2020 Introduction to Data Mining, 2nd Edition 72

l The key principle for effective sampling is the

– Using a sample will work almost as well as using the

– A sample is representative if it has approximately the

01/27/2020 Introduction to Data Mining, 2nd Edition 73

8000 points 2000 Points 500 Points

01/27/2020 Introduction to Data Mining, 2nd Edition 74

01/27/2020 Introduction to Data Mining, 2nd Edition 75

01/27/2020 Introduction to Data Mining, 2nd Edition 76

l Definitions of density and

01/27/2020 Introduction to Data Mining, 2nd Edition 78

l Goal is to find a projection that captures the

01/27/2020 Introduction to Data Mining, 2nd Edition 80

l Another way to reduce dimensionality of data

l Create new attributes that can capture the

l Three general methodologies:

01/27/2020 Introduction to Data Mining, 2nd Edition 82

l Fourier and wavelet transform

Two Sine Waves + Noise Frequency

01/27/2020 Introduction to Data Mining, 2nd Edition 83

l Discretization is the process of converting a

01/27/2020 Introduction to Data Mining, 2nd Edition 84

l Iris Plant data set.

Petal width low or petal length low implies Setosa.

l How can we tell what the best discretization is?

– Supervised discretization: Use class labels to find

01/27/2020 Introduction to Data Mining, 2nd Edition 88

Equal interval width approach used to obtain 4 values.

01/27/2020 Introduction to Data Mining, 2nd Edition 89

Equal frequency approach used to obtain 4 values.

01/27/2020 Introduction to Data Mining, 2nd Edition 90

K-means approach to obtain 4 values.

01/27/2020 Introduction to Data Mining, 2nd Edition 91

l Binarization maps a continuous or categorical

l Typically used for association analysis

l Often convert a continuous attribute to a

l An attribute transform is a function that maps the

Correlations between time series

Correlations between time series

You might also like