0% found this document useful (0 votes)
10 views

Datascience

Uploaded by

Rimsha Rao
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
10 views

Datascience

Uploaded by

Rimsha Rao
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 28

Data Preprocessing

l Aggregation
l Sampling
l Dimensionality Reduction
l Feature subset selection
l Feature creation
l Discretization and Binarization
l Attribute Transformation

01/27/2020 Introduction to Data Mining, 2nd Edition 68


Tan, Steinbach, Karpatne, Kumar
Aggregation

l Combining two or more attributes (or objects) into


a single attribute (or object)

l Purpose
– Data reduction
u Reduce the number of attributes or objects
– Change of scale
u Cities aggregated into regions, states, countries, etc.
u Days aggregated into weeks, months, or years

– More “stable” data


u Aggregated data tends to have less variability

01/27/2020 Introduction to Data Mining, 2nd Edition 69


Tan, Steinbach, Karpatne, Kumar
Example: Precipitation in Australia

l This example is based on precipitation in


Australia from the period 1982 to 1993.
The next slide shows
– A histogram for the standard deviation of average
monthly precipitation for 3,030 0.5◦ by 0.5◦ grid cells in
Australia, and
– A histogram for the standard deviation of the average
yearly precipitation for the same locations.
l The average yearly precipitation has less
variability than the average monthly precipitation.
l All precipitation measurements (and their
standard deviations) are in centimeters.
01/27/2020 Introduction to Data Mining, 2nd Edition 70
Tan, Steinbach, Karpatne, Kumar
Example: Precipitation in Australia …

Variation of Precipitation in Australia

Standard Deviation of Average Standard Deviation of


Monthly Precipitation Average Yearly Precipitation
01/27/2020 Introduction to Data Mining, 2nd Edition 71
Tan, Steinbach, Karpatne, Kumar
Sampling
l Sampling is the main technique employed for data
reduction.
– It is often used for both the preliminary investigation of
the data and the final data analysis.

l Statisticians often sample because obtaining the


entire set of data of interest is too expensive or
time consuming.

l Sampling is typically used in data mining because


processing the entire set of data of interest is too
expensive or time consuming.

01/27/2020 Introduction to Data Mining, 2nd Edition 72


Tan, Steinbach, Karpatne, Kumar
Sampling …

l The key principle for effective sampling is the


following:

– Using a sample will work almost as well as using the


entire data set, if the sample is representative

– A sample is representative if it has approximately the


same properties (of interest) as the original set of data

01/27/2020 Introduction to Data Mining, 2nd Edition 73


Tan, Steinbach, Karpatne, Kumar
Sample Size

8000 points 2000 Points 500 Points

01/27/2020 Introduction to Data Mining, 2nd Edition 74


Tan, Steinbach, Karpatne, Kumar
Types of Sampling
l Simple Random Sampling
– There is an equal probability of selecting any particular
item
– Sampling without replacement
u As each item is selected, it is removed from the
population
– Sampling with replacement
u Objects are not removed from the population as they
are selected for the sample.
u In sampling with replacement, the same object can
be picked up more than once
l Stratified sampling
– Split the data into several partitions; then draw random
samples from each partition

01/27/2020 Introduction to Data Mining, 2nd Edition 75


Tan, Steinbach, Karpatne, Kumar
Sample Size
l What sample size is necessary to get at least one
object from each of 10 equal-sized groups.

01/27/2020 Introduction to Data Mining, 2nd Edition 76


Tan, Steinbach, Karpatne, Kumar
Curse of Dimensionality

l When dimensionality
increases, data becomes
increasingly sparse in the
space that it occupies

l Definitions of density and


distance between points,
which are critical for
clustering and outlier
detection, become less
meaningful •Randomly generate 500 points
•Compute difference between max and
min distance between any pair of points
01/27/2020 Introduction to Data Mining, 2nd Edition 77
Tan, Steinbach, Karpatne, Kumar
Dimensionality Reduction

l Purpose:
– Avoid curse of dimensionality
– Reduce amount of time and memory required by data
mining algorithms
– Allow data to be more easily visualized
– May help to eliminate irrelevant features or reduce
noise

l Techniques
– Principal Components Analysis (PCA)
– Singular Value Decomposition
– Others: supervised and non-linear techniques

01/27/2020 Introduction to Data Mining, 2nd Edition 78


Tan, Steinbach, Karpatne, Kumar
Dimensionality Reduction: PCA

l Goal is to find a projection that captures the


largest amount of variation in data
x2

x1
01/27/2020 Introduction to Data Mining, 2nd Edition 79
Tan, Steinbach, Karpatne, Kumar
Dimensionality Reduction: PCA

01/27/2020 Introduction to Data Mining, 2nd Edition 80


Tan, Steinbach, Karpatne, Kumar
Feature Subset Selection

l Another way to reduce dimensionality of data


l Redundant features
– Duplicate much or all of the information contained in
one or more other attributes
– Example: purchase price of a product and the amount
of sales tax paid
l Irrelevant features
– Contain no information that is useful for the data
mining task at hand
– Example: students' ID is often irrelevant to the task of
predicting students' GPA
l Many techniques developed, especially for
classification
01/27/2020 Introduction to Data Mining, 2nd Edition 81
Tan, Steinbach, Karpatne, Kumar
Feature Creation

l Create new attributes that can capture the


important information in a data set much more
efficiently than the original attributes

l Three general methodologies:


– Feature extraction
u Example: extracting edges from images
– Feature construction
u Example: dividing mass by volume to get density
– Mapping data to new space
u Example: Fourier and wavelet analysis

01/27/2020 Introduction to Data Mining, 2nd Edition 82


Tan, Steinbach, Karpatne, Kumar
Mapping Data to a New Space

l Fourier and wavelet transform

Frequency

Two Sine Waves + Noise Frequency

01/27/2020 Introduction to Data Mining, 2nd Edition 83


Tan, Steinbach, Karpatne, Kumar
Discretization

l Discretization is the process of converting a


continuous attribute into an ordinal attribute
– A potentially infinite number of values are mapped into
a small number of categories
– Discretization is commonly used in classification
– Many classification algorithms work best if both
the independent and dependent variables have
only a few values
– We give an illustration of the usefulness of
discretization using the Iris data set

01/27/2020 Introduction to Data Mining, 2nd Edition 84


Tan, Steinbach, Karpatne, Kumar
Iris Sample Data Set

l Iris Plant data set.


– Can be obtained from the UCI Machine Learning Repository
http://www.ics.uci.edu/~mlearn/MLRepository.html
– From the statistician Douglas Fisher
– Three flower types (classes):
u Setosa
u Versicolour
u Virginica
– Four (non-class) attributes
u Sepal width and length
u Petal width and length Virginica. Robert H. Mohlenbrock. USDA
NRCS. 1995. Northeast wetland flora: Field
office guide to plant species. Northeast National
Technical Center, Chester, PA. Courtesy of
USDA NRCS Wetland Science Institute.
01/27/2020 Introduction to Data Mining, 2nd Edition 85
Tan, Steinbach, Karpatne, Kumar
Discretization: Iris Example

Petal width low or petal length low implies Setosa.


Petal width medium or petal length medium implies Versicolour.
Petal width high or petal length high implies Virginica.
Discretization: Iris Example …

l How can we tell what the best discretization is?


– Unsupervised discretization: find breaks in the data
values 50
u Example:
Petal Length 40

30

Counts
20

10

0
0 2 4 6 8
Petal Length

– Supervised discretization: Use class labels to find


breaks
01/27/2020 Introduction to Data Mining, 2nd Edition 87
Tan, Steinbach, Karpatne, Kumar
Discretization Without Using Class Labels

Data consists of four groups of points and two outliers. Data is one-
dimensional, but a random y component is added to reduce overlap.

01/27/2020 Introduction to Data Mining, 2nd Edition 88


Tan, Steinbach, Karpatne, Kumar
Discretization Without Using Class Labels

Equal interval width approach used to obtain 4 values.

01/27/2020 Introduction to Data Mining, 2nd Edition 89


Tan, Steinbach, Karpatne, Kumar
Discretization Without Using Class Labels

Equal frequency approach used to obtain 4 values.

01/27/2020 Introduction to Data Mining, 2nd Edition 90


Tan, Steinbach, Karpatne, Kumar
Discretization Without Using Class Labels

K-means approach to obtain 4 values.

01/27/2020 Introduction to Data Mining, 2nd Edition 91


Tan, Steinbach, Karpatne, Kumar
Binarization

l Binarization maps a continuous or categorical


attribute into one or more binary variables

l Typically used for association analysis

l Often convert a continuous attribute to a


categorical attribute and then convert a
categorical attribute to a set of binary attributes
– Association analysis needs asymmetric binary
attributes
– Examples: eye color and height measured as
{low, medium, high}
01/27/2020 Introduction to Data Mining, 2nd Edition 92
Tan, Steinbach, Karpatne, Kumar
Attribute Transformation

l An attribute transform is a function that maps the


entire set of values of a given attribute to a new
set of replacement values such that each old
value can be identified with one of the new values
– Simple functions: xk, log(x), ex, |x|
– Normalization
Refers to various techniques to adjust to
u
differences among attributes in terms of frequency
of occurrence, mean, variance, range
u Take out unwanted, common signal, e.g.,
seasonality
– In statistics, standardization refers to subtracting off
the means and dividing by the standard deviation
01/27/2020 Introduction to Data Mining, 2nd Edition 93
Tan, Steinbach, Karpatne, Kumar
Example: Sample Time Series of Plant Growth
Minneapolis

Net Primary
Production (NPP)
is a measure of
plant growth used
by ecosystem
scientists.

Correlations between time series


Correlations between time series
Minneapolis Atlanta Sao Paolo
Minneapolis 1.0000 0.7591 -0.7581
Atlanta 0.7591 1.0000 -0.5739
Sao Paolo -0.7581 -0.5739 1.0000
01/27/2020 Introduction to Data Mining, 2nd Edition 94
Tan, Steinbach, Karpatne, Kumar
Seasonality Accounts for Much Correlation
Minneapolis
Normalized using
monthly Z Score:
Subtract off monthly
mean and divide by
monthly standard
deviation

Correlations between time series


Correlations between time series
Minneapolis Atlanta Sao Paolo
Minneapolis 1.0000 0.0492 0.0906
Atlanta 0.0492 1.0000 -0.0154
Sao Paolo 0.0906 -0.0154 1.0000
01/27/2020 Introduction to Data Mining, 2nd Edition 95
Tan, Steinbach, Karpatne, Kumar

You might also like