0% found this document useful (0 votes)
10 views32 pages

CSC 522 Lecture3

The lecture covers data preparation techniques essential for automated learning and data analysis, including aggregation, sampling, feature subset selection, dimensionality reduction, feature creation, and discretization. It emphasizes the importance of representative sampling and various sampling methods, as well as the selection of relevant features to improve model performance. Additionally, it discusses discretization and binarization methods for handling continuous and categorical data.

Uploaded by

riyanbihani
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
10 views32 pages

CSC 522 Lecture3

The lecture covers data preparation techniques essential for automated learning and data analysis, including aggregation, sampling, feature subset selection, dimensionality reduction, feature creation, and discretization. It emphasizes the importance of representative sampling and various sampling methods, as well as the selection of relevant features to improve model performance. Additionally, it discusses discretization and binarization methods for handling continuous and categorical data.

Uploaded by

riyanbihani
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 32

CSC522 - Automated Learning and Data Analysis

Lecture 3: Data Preparation

Dr. Pankaj. R. Telang

North Carolina State University

August 29, 2024

Telang (NCSU) Lecture 1 August 29, 2024 1 / 22


Agenda

Aggregation
Sampling
Feature subset selection
Dimensionality Reduction
Feature creation
Discretization

Telang (NCSU) Lecture 1 August 29, 2024 2 / 22


Data Mining

Telang (NCSU) Lecture 1 August 29, 2024 3 / 22


Question

Suppose you have a very large dataset. It is expensive and time


consuming to process the dataset. What can you do?

Telang (NCSU) Lecture 1 August 29, 2024 4 / 22


Aggregation

Combining two or more objects into a single object


Aggregation reduces attributes or objects
Aggregated data is more stable, it has less variability

Telang (NCSU) Lecture 1 August 29, 2024 5 / 22


Sampling

Processing a large data set is too expensive or time consuming


Sampling is employed to reduce the amount of data
For effective sampling, the sample should be representative of the
entire data set
▶ A representative sample has approximately the same properties as the
entire data set

Telang (NCSU) Lecture 1 August 29, 2024 6 / 22


Sampling

Processing a large data set is too expensive or time consuming


Sampling is employed to reduce the amount of data
For effective sampling, the sample should be representative of the
entire data set
▶ A representative sample has approximately the same properties as the
entire data set
▶ A sample must be large enough to represent the patterns in the
population

Telang (NCSU) Lecture 1 August 29, 2024 6 / 22


Types of Sampling

Simple random sampling


▶ Equal probability of selecting any particular object
▶ Sampling without replacement: selected objects are removed from the
population
▶ Sampling with replacement: selected objects are not removed, same
object can be picked more than once

Telang (NCSU) Lecture 1 August 29, 2024 7 / 22


Types of Sampling

Simple random sampling


▶ Equal probability of selecting any particular object
▶ Sampling without replacement: selected objects are removed from the
population
▶ Sampling with replacement: selected objects are not removed, same
object can be picked more than once

Stratified sampling
▶ Sample from each group
▶ Ensures all groups are represented

Telang (NCSU) Lecture 1 August 29, 2024 7 / 22


Stratified Sampling Variations

Equal number of objects from each group

Number of objects proportional to group size

Telang (NCSU) Lecture 1 August 29, 2024 8 / 22


Types of Sampling

Progressive sampling
▶ Start with a small sample
▶ Progressively, increase the sample size until the size is sufficient
▶ Eliminates the need to determine the sample size
▶ Requires a way to evaluate the sample and judge if it’s large enough

Telang (NCSU) Lecture 1 August 29, 2024 9 / 22


Features

Features (X ) are independent variables


▶ Also called as attributes or variables
▶ Often computed by transforming an attribute
Target (y ) is the dependent variable
▶ This is what we want to predict

Telang (NCSU) Lecture 1 August 29, 2024 10 / 22


Features

Features (X ) are independent variables


▶ Also called as attributes or variables
▶ Often computed by transforming an attribute
Target (y ) is the dependent variable
▶ This is what we want to predict
A model f yields target y given the features X

y = f (X )

Telang (NCSU) Lecture 1 August 29, 2024 10 / 22


Feature Subset Selection

In case of too many features, we may need to select a subset of


features
▶ Intent is to make learning more efficient and to improve the learning
algorithm performance

Telang (NCSU) Lecture 1 August 29, 2024 11 / 22


Feature Subset Selection

In case of too many features, we may need to select a subset of


features
▶ Intent is to make learning more efficient and to improve the learning
algorithm performance
Redundant feature: Duplicate information contained in one or more
other attributes
▶ Purchase price of a product and the amount of sales tax paid

Telang (NCSU) Lecture 1 August 29, 2024 11 / 22


Feature Subset Selection

In case of too many features, we may need to select a subset of


features
▶ Intent is to make learning more efficient and to improve the learning
algorithm performance
Redundant feature: Duplicate information contained in one or more
other attributes
▶ Purchase price of a product and the amount of sales tax paid
Irrelevant feature: Contains no information that is useful for the data
mining task at hand
▶ Example: Student ID is irrelevant in predicting GPA

Telang (NCSU) Lecture 1 August 29, 2024 11 / 22


Using Correlation For Feature Selection
Example

Which feature is better for predicting y?

Telang (NCSU) Lecture 1 August 29, 2024 12 / 22


Using Correlation For Feature Selection
Warning

Correlation measures linear relationship


Zero correlation does not mean that features are not related, they
may be related non-linearly

Telang (NCSU) Lecture 1 August 29, 2024 13 / 22


Using Correlation For Feature Selection
Warning

Correlation measures linear relationship


Zero correlation does not mean that features are not related, they
may be related non-linearly
Features may be useful only with other features.

credit: “DenisBoigelot,” Wikimedia Commons

Telang (NCSU) Lecture 1 August 29, 2024 13 / 22


Feature Subset Selection Techniques

Embedded: Feature selection occurs naturally as part of the data


mining algorithm
▶ Example: Decision trees

Telang (NCSU) Lecture 1 August 29, 2024 14 / 22


Feature Subset Selection Techniques

Embedded: Feature selection occurs naturally as part of the data


mining algorithm
▶ Example: Decision trees
Filter: Features are selected before data mining algorithm is run
▶ Example: Select attributes whose pairwise correlation is low

Telang (NCSU) Lecture 1 August 29, 2024 14 / 22


Feature Subset Selection Techniques

Embedded: Feature selection occurs naturally as part of the data


mining algorithm
▶ Example: Decision trees
Filter: Features are selected before data mining algorithm is run
▶ Example: Select attributes whose pairwise correlation is low
Wrapper: Use the data mining algorithm as a black box to find the
best subset of attributes
▶ Example: Run data mining algorithm with various subsets of attributes,
and select the one that gives best performance

Telang (NCSU) Lecture 1 August 29, 2024 14 / 22


Discretization

Discretization is the process of converting a continuous attribute into


an ordinal attribute
Example: [1, 3, 5, 8, 10, 12] → [Low, Low, Mid, Mid, High, High]
Useful when a model cannot use continuous features

Telang (NCSU) Lecture 1 August 29, 2024 15 / 22


Discretization

Discretization is the process of converting a continuous attribute into


an ordinal attribute
Example: [1, 3, 5, 8, 10, 12] → [Low, Low, Mid, Mid, High, High]
Useful when a model cannot use continuous features
Two types
▶ Unsupervised discretization: Class information is not used
⋆ E.g., equal interval, equal frequency, K-means clustering
▶ Supervised discretization: Class information is used
⋆ Place the splits to maximize the “purity”

Telang (NCSU) Lecture 1 August 29, 2024 15 / 22


Unsupervised Discretization
Equal frequency vs interval

Data:
1 2 4 4 6 25 30 80 100

3 equal frequency bins


1 2 4 4 6 25 30 80 100

3 equal interval bins of size: (100 - 1)/3 = 33


1 2 4 4 6 25 30 80 100

Telang (NCSU) Lecture 1 August 29, 2024 16 / 22


Unsupervised Discretization

Telang (NCSU) Lecture 1 August 29, 2024 17 / 22


Supervised Discretization

Class labels are available


Idea is to place the splits to maximize the “purity”
Purity: A measure of how homogeneous the class labels are in a bin
or group.
Example:

Telang (NCSU) Lecture 1 August 29, 2024 18 / 22


Supervised Discretization

Split 1

Left bin has 2T and 4F, i.e. 4/6 = 66% F


Right bin has 2T and 1F, i.e. 2/3 = 66% T

Telang (NCSU) Lecture 1 August 29, 2024 19 / 22


Supervised Discretization

Split 1

Left bin has 2T and 4F, i.e. 4/6 = 66% F


Right bin has 2T and 1F, i.e. 2/3 = 66% T
Split 2

Left bin has 2T, i.e. 100% T


Right bin has 2T and 5F, i.e. 5/7 = 71% F

Telang (NCSU) Lecture 1 August 29, 2024 19 / 22


Binarization

Binarization maps a continuous or categorical attribute into one or


more binary variables

Continuous attribute example


▶ Data:
1 2 4 4 6 25 30 80 100
▶ 2 equal frequency bins
1 2 4 4 6 25 30 80 100
▶ 2 equal interval bins of size: (100 - 1)/2 = 49.5
1 2 4 4 6 25 30 80 100

Telang (NCSU) Lecture 1 August 29, 2024 20 / 22


Binarization

Categorical attribute example


Categorical Value Integer Value
awful 0
poor 1
OK 2
good 3
great 4

Categorical Value Integer Value


awful 0
poor 1
OK 2
good 3
great 4

Telang (NCSU) Lecture 1 August 29, 2024 21 / 22


Binarization

Categorical attribute example


Categorical Value Integer Value x1 x2 x3
awful 0 0 0 0
poor 1 0 0 1
OK 2 0 1 0
good 3 0 1 1
great 4 1 0 0

Categorical Value Integer Value x1 x2 x3 x4 x5


awful 0 1 0 0 0 0
poor 1 0 1 0 0 0
OK 2 0 0 1 0 0
good 3 0 0 0 1 0
great 4 0 0 0 0 1

Telang (NCSU) Lecture 1 August 29, 2024 22 / 22

You might also like