0% found this document useful (0 votes)

10 views32 pages

CSC 522 Lecture3

The lecture covers data preparation techniques essential for automated learning and data analysis, including aggregation, sampling, feature subset selection, dimensionality reduction, feature creation, and discretization. It emphasizes the importance of representative sampling and various sampling methods, as well as the selection of relevant features to improve model performance. Additionally, it discusses discretization and binarization methods for handling continuous and categorical data.

Uploaded by

riyanbihani

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

10 views32 pages

CSC 522 Lecture3

Uploaded by

riyanbihani

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 32

CSC522 - Automated Learning and Data Analysis

Lecture 3: Data Preparation

Dr. Pankaj. R. Telang

North Carolina State University

August 29, 2024

Telang (NCSU) Lecture 1 August 29, 2024 1 / 22

Agenda

Aggregation
Sampling
Feature subset selection
Dimensionality Reduction
Feature creation
Discretization

Telang (NCSU) Lecture 1 August 29, 2024 2 / 22

Data Mining

Telang (NCSU) Lecture 1 August 29, 2024 3 / 22

Question

Suppose you have a very large dataset. It is expensive and time

consuming to process the dataset. What can you do?

Telang (NCSU) Lecture 1 August 29, 2024 4 / 22

Aggregation

Combining two or more objects into a single object

Aggregation reduces attributes or objects
Aggregated data is more stable, it has less variability

Telang (NCSU) Lecture 1 August 29, 2024 5 / 22

Sampling

Processing a large data set is too expensive or time consuming

Sampling is employed to reduce the amount of data
For effective sampling, the sample should be representative of the
entire data set
▶ A representative sample has approximately the same properties as the
entire data set

Telang (NCSU) Lecture 1 August 29, 2024 6 / 22

Sampling

Processing a large data set is too expensive or time consuming

Telang (NCSU) Lecture 1 August 29, 2024 6 / 22

Types of Sampling

Simple random sampling

▶ Equal probability of selecting any particular object
▶ Sampling without replacement: selected objects are removed from the
population
▶ Sampling with replacement: selected objects are not removed, same
object can be picked more than once

Telang (NCSU) Lecture 1 August 29, 2024 7 / 22

Types of Sampling

Simple random sampling

Stratified sampling
▶ Sample from each group
▶ Ensures all groups are represented

Telang (NCSU) Lecture 1 August 29, 2024 7 / 22

Stratified Sampling Variations

Equal number of objects from each group

Number of objects proportional to group size

Telang (NCSU) Lecture 1 August 29, 2024 8 / 22

Types of Sampling

Progressive sampling
▶ Start with a small sample
▶ Progressively, increase the sample size until the size is sufficient
▶ Eliminates the need to determine the sample size
▶ Requires a way to evaluate the sample and judge if it’s large enough

Telang (NCSU) Lecture 1 August 29, 2024 9 / 22

Features

Features (X ) are independent variables

▶ Also called as attributes or variables
▶ Often computed by transforming an attribute
Target (y ) is the dependent variable
▶ This is what we want to predict

Telang (NCSU) Lecture 1 August 29, 2024 10 / 22

Features

Features (X ) are independent variables

▶ Also called as attributes or variables
▶ Often computed by transforming an attribute
Target (y ) is the dependent variable
▶ This is what we want to predict
A model f yields target y given the features X

y = f (X )

Telang (NCSU) Lecture 1 August 29, 2024 10 / 22

Feature Subset Selection

In case of too many features, we may need to select a subset of

features
▶ Intent is to make learning more efficient and to improve the learning
algorithm performance

Telang (NCSU) Lecture 1 August 29, 2024 11 / 22

Feature Subset Selection

In case of too many features, we may need to select a subset of

Telang (NCSU) Lecture 1 August 29, 2024 11 / 22

Feature Subset Selection

In case of too many features, we may need to select a subset of

features
▶ Intent is to make learning more efficient and to improve the learning
algorithm performance
Redundant feature: Duplicate information contained in one or more
other attributes
▶ Purchase price of a product and the amount of sales tax paid
Irrelevant feature: Contains no information that is useful for the data
mining task at hand
▶ Example: Student ID is irrelevant in predicting GPA

Telang (NCSU) Lecture 1 August 29, 2024 11 / 22

Using Correlation For Feature Selection
Example

Which feature is better for predicting y?

Telang (NCSU) Lecture 1 August 29, 2024 12 / 22

Using Correlation For Feature Selection
Warning

Correlation measures linear relationship

Zero correlation does not mean that features are not related, they
may be related non-linearly

Telang (NCSU) Lecture 1 August 29, 2024 13 / 22

Using Correlation For Feature Selection
Warning

Correlation measures linear relationship

Zero correlation does not mean that features are not related, they
may be related non-linearly
Features may be useful only with other features.

credit: “DenisBoigelot,” Wikimedia Commons

Telang (NCSU) Lecture 1 August 29, 2024 13 / 22

Feature Subset Selection Techniques

Embedded: Feature selection occurs naturally as part of the data

mining algorithm
▶ Example: Decision trees

Telang (NCSU) Lecture 1 August 29, 2024 14 / 22

Feature Subset Selection Techniques

Embedded: Feature selection occurs naturally as part of the data

mining algorithm
▶ Example: Decision trees
Filter: Features are selected before data mining algorithm is run
▶ Example: Select attributes whose pairwise correlation is low

Telang (NCSU) Lecture 1 August 29, 2024 14 / 22

Feature Subset Selection Techniques

Embedded: Feature selection occurs naturally as part of the data

mining algorithm
▶ Example: Decision trees
Filter: Features are selected before data mining algorithm is run
▶ Example: Select attributes whose pairwise correlation is low
Wrapper: Use the data mining algorithm as a black box to find the
best subset of attributes
▶ Example: Run data mining algorithm with various subsets of attributes,
and select the one that gives best performance

Telang (NCSU) Lecture 1 August 29, 2024 14 / 22

Discretization

Discretization is the process of converting a continuous attribute into

an ordinal attribute
Example: [1, 3, 5, 8, 10, 12] → [Low, Low, Mid, Mid, High, High]
Useful when a model cannot use continuous features

Telang (NCSU) Lecture 1 August 29, 2024 15 / 22

Discretization

Discretization is the process of converting a continuous attribute into

an ordinal attribute
Example: [1, 3, 5, 8, 10, 12] → [Low, Low, Mid, Mid, High, High]
Useful when a model cannot use continuous features
Two types
▶ Unsupervised discretization: Class information is not used
⋆ E.g., equal interval, equal frequency, K-means clustering
▶ Supervised discretization: Class information is used
⋆ Place the splits to maximize the “purity”

Telang (NCSU) Lecture 1 August 29, 2024 15 / 22

Unsupervised Discretization
Equal frequency vs interval

Data:
1 2 4 4 6 25 30 80 100

3 equal frequency bins

1 2 4 4 6 25 30 80 100

3 equal interval bins of size: (100 - 1)/3 = 33

1 2 4 4 6 25 30 80 100

Telang (NCSU) Lecture 1 August 29, 2024 16 / 22

Unsupervised Discretization

Telang (NCSU) Lecture 1 August 29, 2024 17 / 22

Supervised Discretization

Class labels are available

Idea is to place the splits to maximize the “purity”
Purity: A measure of how homogeneous the class labels are in a bin
or group.
Example:

Telang (NCSU) Lecture 1 August 29, 2024 18 / 22

Supervised Discretization

Split 1

Left bin has 2T and 4F, i.e. 4/6 = 66% F

Right bin has 2T and 1F, i.e. 2/3 = 66% T

Telang (NCSU) Lecture 1 August 29, 2024 19 / 22

Supervised Discretization

Split 1

Left bin has 2T and 4F, i.e. 4/6 = 66% F

Right bin has 2T and 1F, i.e. 2/3 = 66% T
Split 2

Left bin has 2T, i.e. 100% T

Right bin has 2T and 5F, i.e. 5/7 = 71% F

Telang (NCSU) Lecture 1 August 29, 2024 19 / 22

Binarization

Binarization maps a continuous or categorical attribute into one or

more binary variables

Continuous attribute example

▶ Data:
1 2 4 4 6 25 30 80 100
▶ 2 equal frequency bins
1 2 4 4 6 25 30 80 100
▶ 2 equal interval bins of size: (100 - 1)/2 = 49.5
1 2 4 4 6 25 30 80 100

Telang (NCSU) Lecture 1 August 29, 2024 20 / 22

Binarization

Categorical attribute example

Categorical Value Integer Value
awful 0
poor 1
OK 2
good 3
great 4

Categorical Value Integer Value

awful 0
poor 1
OK 2
good 3
great 4

Telang (NCSU) Lecture 1 August 29, 2024 21 / 22

Binarization

Categorical attribute example

Categorical Value Integer Value x1 x2 x3
awful 0 0 0 0
poor 1 0 0 1
OK 2 0 1 0
good 3 0 1 1
great 4 1 0 0

Categorical Value Integer Value x1 x2 x3 x4 x5

awful 0 1 0 0 0 0
poor 1 0 1 0 0 0
OK 2 0 0 1 0 0
good 3 0 0 0 1 0
great 4 0 0 0 0 1

Telang (NCSU) Lecture 1 August 29, 2024 22 / 22

Foundations of Machine Learning: Sudeshna Sarkar IIT Kharagpur
No ratings yet
Foundations of Machine Learning: Sudeshna Sarkar IIT Kharagpur
40 pages
Feature Selection 1692278667
No ratings yet
Feature Selection 1692278667
100 pages
03 Data Preparation
No ratings yet
03 Data Preparation
28 pages
DM 2 Part 2
No ratings yet
DM 2 Part 2
35 pages
ML Lecture 02
No ratings yet
ML Lecture 02
40 pages
PPT4 TOPIK4 R0 Predictive Modeling
No ratings yet
PPT4 TOPIK4 R0 Predictive Modeling
35 pages
Workbook of Pattern Recognition
No ratings yet
Workbook of Pattern Recognition
11 pages
Dimenn Red PDF
No ratings yet
Dimenn Red PDF
135 pages
Wrapper Method
No ratings yet
Wrapper Method
58 pages
WK 6 Nearest Neighbor Classifier and Bayesian Classifier 1 PPT
No ratings yet
WK 6 Nearest Neighbor Classifier and Bayesian Classifier 1 PPT
23 pages
UNIT - II - Data Mining Essentials
No ratings yet
UNIT - II - Data Mining Essentials
20 pages
WK 6 Nearest Neighbor Classifier and Bayesian Classifier 12-05-2021
No ratings yet
WK 6 Nearest Neighbor Classifier and Bayesian Classifier 12-05-2021
23 pages
کتاب پنجم بارگزاری شده
No ratings yet
کتاب پنجم بارگزاری شده
35 pages
Unit 2 Part 4
No ratings yet
Unit 2 Part 4
47 pages
CZ4032 Data Analytics & Mining Notes
No ratings yet
CZ4032 Data Analytics & Mining Notes
16 pages
Data Preprocessing
No ratings yet
Data Preprocessing
33 pages
CH2 Data Cleaning
No ratings yet
CH2 Data Cleaning
41 pages
Feature Selection
No ratings yet
Feature Selection
56 pages
Feature Selection
No ratings yet
Feature Selection
18 pages
3b Features PDF
No ratings yet
3b Features PDF
40 pages
Dimensionality Reduction
No ratings yet
Dimensionality Reduction
117 pages
6 Data Preprocessing
No ratings yet
6 Data Preprocessing
37 pages
11.feature Selection, Extraction
No ratings yet
11.feature Selection, Extraction
38 pages
R21 Unit 2
No ratings yet
R21 Unit 2
101 pages
DM-2Preprocessing 2
No ratings yet
DM-2Preprocessing 2
61 pages
3.1 Feature Selection
No ratings yet
3.1 Feature Selection
35 pages
Lecture#10
No ratings yet
Lecture#10
24 pages
3-Data Pre-Processing
No ratings yet
3-Data Pre-Processing
18 pages
Data Pre Processing
No ratings yet
Data Pre Processing
26 pages
Feature Selection: Slide 1
No ratings yet
Feature Selection: Slide 1
29 pages
Module5.2 Feature Selection Methods
No ratings yet
Module5.2 Feature Selection Methods
64 pages
Feature Selection
No ratings yet
Feature Selection
36 pages
ML Lecture 6 7 Preprocess
No ratings yet
ML Lecture 6 7 Preprocess
43 pages
CH 4
No ratings yet
CH 4
21 pages
10-2 Data Analysis and Pre-Processing Part 4 PDF
No ratings yet
10-2 Data Analysis and Pre-Processing Part 4 PDF
23 pages
Module-3 DSV
No ratings yet
Module-3 DSV
20 pages
Chap 18 B
No ratings yet
Chap 18 B
22 pages
Eature Engineering: Presenter: Prof. Amit Kumar Das
No ratings yet
Eature Engineering: Presenter: Prof. Amit Kumar Das
17 pages
DWM Exp6 C49
No ratings yet
DWM Exp6 C49
15 pages
CH1-data Preprocessing
No ratings yet
CH1-data Preprocessing
49 pages
Data Mining: Practical Machine Learning Tools and Techniques
No ratings yet
Data Mining: Practical Machine Learning Tools and Techniques
69 pages
DWDM AR16 Unit 1.2
No ratings yet
DWDM AR16 Unit 1.2
14 pages
Hw3- Trần Thị Thanh Ngân-ielsiu18223
100% (1)
Hw3- Trần Thị Thanh Ngân-ielsiu18223
11 pages
Preprocessing
No ratings yet
Preprocessing
52 pages
A Comparative Study Between Feature Selection Algorithms - Ok
No ratings yet
A Comparative Study Between Feature Selection Algorithms - Ok
10 pages
DM Chapter 4
No ratings yet
DM Chapter 4
47 pages
Unit 4
No ratings yet
Unit 4
20 pages
DW&DM (Unit - 4)
No ratings yet
DW&DM (Unit - 4)
9 pages
WS - Data Analytics Fundamental-R
No ratings yet
WS - Data Analytics Fundamental-R
51 pages
Clustering Before Classification
No ratings yet
Clustering Before Classification
3 pages
Data Mining: Kabith Sivaprasad (BE/1234/2009) Rimjhim (BE/1134/2009) Utkarsh Ahuja (BE/1226/2009)
No ratings yet
Data Mining: Kabith Sivaprasad (BE/1234/2009) Rimjhim (BE/1134/2009) Utkarsh Ahuja (BE/1226/2009)
32 pages
Data Mining Lecture 10B: Classification
No ratings yet
Data Mining Lecture 10B: Classification
62 pages
Chandra Shekar 2014
No ratings yet
Chandra Shekar 2014
13 pages
Data Preprocessing
No ratings yet
Data Preprocessing
8 pages
Toward Integrating Feature Selection Algorithms For Classification and Clustering-M7s PDF
No ratings yet
Toward Integrating Feature Selection Algorithms For Classification and Clustering-M7s PDF
12 pages
Feature Engg Pre Processing Python
No ratings yet
Feature Engg Pre Processing Python
68 pages
Data Preprocessing
No ratings yet
Data Preprocessing
77 pages
Types of Data (Qualitative and Quantitative)
No ratings yet
Types of Data (Qualitative and Quantitative)
89 pages
National Institute of Technology Rourkela: Answer All Questions. Figures in The Right Hand Margin Indicate Marks
No ratings yet
National Institute of Technology Rourkela: Answer All Questions. Figures in The Right Hand Margin Indicate Marks
3 pages
Data Pre-Processing: - Data Cleaning - Data Integration - Data Transformation - Data Reduction - Data Discretization
No ratings yet
Data Pre-Processing: - Data Cleaning - Data Integration - Data Transformation - Data Reduction - Data Discretization
55 pages
CIS664-Knowledge Discovery and Data Mining
No ratings yet
CIS664-Knowledge Discovery and Data Mining
52 pages
Backpropagation Learning in Neural Networks
No ratings yet
Backpropagation Learning in Neural Networks
27 pages
Adaptive Filters
No ratings yet
Adaptive Filters
23 pages
DS Lab Manual
No ratings yet
DS Lab Manual
36 pages
Data Mining-Module Ii Notes (S4 Bca)
No ratings yet
Data Mining-Module Ii Notes (S4 Bca)
40 pages
Example: Maximization: Z X X X X M X X M X X X X, X
100% (1)
Example: Maximization: Z X X X X M X X M X X X X, X
9 pages
DSP LAB of Filter
No ratings yet
DSP LAB of Filter
3 pages
Max Dea Book
No ratings yet
Max Dea Book
295 pages
A GSM Simulation Platform Using MATLAB
No ratings yet
A GSM Simulation Platform Using MATLAB
9 pages
2 Dfs
No ratings yet
2 Dfs
55 pages
Signals & Systems Unit II: Fourier Series Representation of Continuous-Time Periodic Signals
No ratings yet
Signals & Systems Unit II: Fourier Series Representation of Continuous-Time Periodic Signals
18 pages
A) What Is An Algorithm? Write The Properties of The Algorithm
No ratings yet
A) What Is An Algorithm? Write The Properties of The Algorithm
21 pages
ADA Manual Updated
No ratings yet
ADA Manual Updated
54 pages
NM Lab789
No ratings yet
NM Lab789
3 pages
Flowchart and Pseudo-Code (Assignment For FEA) Scribd
No ratings yet
Flowchart and Pseudo-Code (Assignment For FEA) Scribd
4 pages
Implementation of Spectrum Analyzer Using GOERTZEL Algorithm
No ratings yet
Implementation of Spectrum Analyzer Using GOERTZEL Algorithm
7 pages
Syllabus - 3650014 Machine Learning
No ratings yet
Syllabus - 3650014 Machine Learning
3 pages
Insertion Sort
No ratings yet
Insertion Sort
16 pages
Analysis of Ensemble Machine Learning Classification Comparison On The Skin Cancer MNIST Dataset
No ratings yet
Analysis of Ensemble Machine Learning Classification Comparison On The Skin Cancer MNIST Dataset
8 pages
Python Lab Manual
No ratings yet
Python Lab Manual
30 pages
Lect 07
No ratings yet
Lect 07
31 pages
BCS303 - Artificial Intelligence - Game Theory
No ratings yet
BCS303 - Artificial Intelligence - Game Theory
7 pages
Euclid's Algorithm: ENGI 1331: Exam 2 Review - Additional Practice Problems Fall 2020
No ratings yet
Euclid's Algorithm: ENGI 1331: Exam 2 Review - Additional Practice Problems Fall 2020
4 pages
Wipro Engineering Online Test Curriculum: Every Eligible Candidate Must Go Through Below Online Assessment (110 Minutes)
No ratings yet
Wipro Engineering Online Test Curriculum: Every Eligible Candidate Must Go Through Below Online Assessment (110 Minutes)
2 pages
Cee 6550 Ftcs CN Diffusion
No ratings yet
Cee 6550 Ftcs CN Diffusion
12 pages
Unit - 1: Analysis of Algorithm
No ratings yet
Unit - 1: Analysis of Algorithm
16 pages
Cse257 06 Numerical Differentiation
No ratings yet
Cse257 06 Numerical Differentiation
17 pages
CS772 Project Proposal
No ratings yet
CS772 Project Proposal
2 pages
Partitioning Algorithms
No ratings yet
Partitioning Algorithms
5 pages