0% found this document useful (0 votes)

35 views45 pages

CS322 Lec5 S25

The document outlines the key concepts of data preprocessing in data analysis, focusing on data quality and major tasks such as data cleaning, integration, reduction, and transformation. It emphasizes techniques for attribute subset selection and numerosity reduction, including methods like decision tree induction and regression. The document also discusses heuristic feature selection methods to improve model performance by reducing irrelevant or redundant data.

Uploaded by

aserhesham99

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

35 views45 pages

CS322 Lec5 S25

Uploaded by

aserhesham99

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 45

CSCI322:

CSCI322: Data
Data
Analysis

Lecture 1:
5: Data Types,
Preprocessing
Collection,
(Part 3) Sampling

Dr.
Dr. Noha
Noha Gamal,
Gamal, Dr.
Dr. Mona
Shimaa
Arafa, and Dr. Mustafa Elattar
Mohamed
Outline

● Data Pre-processing: An Overview

○ Data Quality
● Major Tasks in Data Pre-processing

o Data Cleaning

o Data Integration

o Data Reduction

o Data Transformation and Data Discretization

● Summary

CS322 – Data Analysis Mustafa Elattar and Noha Gamal 2

Data Reduction: 2. Attribute Subset Selection
◼ Reduction Focus: This technique reduces data by selecting a subset of
relevant features (attributes) and discarding irrelevant or redundant ones. It
can simplify the model, reduce overfitting, and potentially improve model
performance.

◼ Applied to: Data Features (Attributes).

◼ Popular methods: {Chi-square, Correlation, Tree-based Models, Heuristic

selection methods, Recursive feature elimination}

◼ Attribute Subset Selection (Feature Selection) techniques

◼ Decision Tree Induction
◼ Heuristic selection methods

CS322 – Data Analysis Mustafa Elattar and Noha Gamal 3

Attribute Subset Selection (Feature Selection):
• Example: In a dataset of online product reviews, you want to identify the
most influential features (words or phrases) in predicting product sentiment
(positive or negative). You start with a dataset containing various features,
such as the frequency of individual words and phrases in each review.
To reduce the data through feature Co-founding features (independent/explanatory/predictors
Response
(dep.
selection, you might use techniques like features/variables/attributes) variable)

chi-squared tests to identify which Review

ID Excellent Poor Quality Affordable ... Sentiment
features are most relevant for sentiment 1 3 1 2 2 ... Positive
analysis. 2 0 5 2 1 ... Negative

To use Chi-Square, Construct a 3 4 0 3 3 ... Positive

contingency table for each feature. For 4 0 3 2 2 ... Negative

5 2 2 5 4 ... Positive
example, the contingency table for the
"Excellent" feature would look like this: Review
ID Excellent Sentiment
Sentiment Sentiment
1 3 Positive
= Positive = Negative
2 0 Negative
Excellent =
0 (count) 2 (count) 3 4 Positive
0
4 0 Negative
Excellent > 5 2 Positive
3 (count) 0 (count)
0

CS322 – Data Analysis Mustafa Elattar and Noha Gamal 4

Decision Tree Induction (Supervised method)

Initial attribute set: {A1, A2, A3, A4, A5, A6}

A4 ?

A1? A6?

Class 1 Class 2 Class 1 Class 2

> Reduced attribute set: {A1, A4, A6}

Decision Tree Induction (Supervised method)

Initial attribute set: {Height, Weight, Age, Job, Marital Status, Number of Kids}, Target {Gender}

Height
>=165

Y
Weight>=80 Weight>=70

Y
Y

Class 1 Class 2 Class 1 Class 2

> Reduced attribute set: {Weight, Height}

Example: Let's consider a dataset where we want to predict if a student will pass
(1) or fail (0) based on three features: study hours, attendance rate, and student ID.
Decision trees inherently perform feature selection by choosing which features to split on
based on certain criteria (like Gini impurity (decrease) or information gain (increase))
Initial attribute set: {A1, A2, A3} stude study_ attendance
A1 or A2 gave the same purity nt_id hours _rate passed
A3 is irrelevant 101 5 0.8 1
Selected Attribute set: {A1} 102 2 0.5 0
103 6 0.9 1
104 3 0.6 0
105 7 0.95 1

Redundant Irrelevant
(OVERFITTING)

CS322 – Data Analysis Mustafa Elattar and Noha Gamal 8

study_
hours Attendance passed What if we have ambiguity in
the selected feature:
5 5 1 accuracy will decrease
2 0 0
The condition will
6 5 1
Gini=(1-(3/7)^2-(4/7)^2) be applied for
3 1 0 splitting
5 1 0
Current segment
6 5 1 characteristics
7 5 1
Y
N

Gini=(1-(3/4)^2-(1/4)^2)

Y
N
This segment is not
representing a pure class

Gini=(1-(1/2)^2-(1/2)^2)

CS322 – Data Analysis Mustafa Elattar and Noha Gamal

study_
hours Attendance passed What if we have ambiguity in
the selected feature:
5 5 1 accuracy will decrease
2 0 0
6 5 1
Gini=(1-(3/7)^2-(4/7)^2)
3 1 0
5 1 0
6 5 1
7 5 1

What ever we do using the same feature will lead to degraded

accuracy, so the solution is to employ another feature that can help in
increasing the Gini=(1-(3/4)^2-(1/4)^2)
accuracy such as the attendance rate and re-build the
model

Gini=(1-(1/2)^2-(1/2)^2)

CS322 – Data Analysis Mustafa Elattar and Noha Gamal

Attendance
<5

Y
N

Fail Pass

CS322 – Data Analysis Mustafa Elattar and Noha Gamal

Heuristic Feature Selection Methods
● Heuristic (empirical/experimental) Feature Selection Methods refer to a set of
techniques designed to select a subset of the most important features from the
original feature set. The goal is to reduce the dataset while retaining as much of the
relevant information as possible. This can lead to simpler models, faster training times,
and sometimes even better performance.
● Considering a dataset with d features, there are 2d possible subsets of those features
(including the empty set and the full set). Exhaustively evaluating all possible
subsets is computationally infeasible when d is large. Hence, heuristic methods
are used to search the feature space more efficiently.
● Several heuristic feature selection methods:
○ Best single features under the feature independence assumption: choose by
significance tests.
○ Best step-wise feature selection:
■ The best single-feature is picked first
■ Then next best feature condition to the first, ...
○ Step-wise feature elimination:
■ Repeatedly eliminate the worst feature

CS322 – Data Analysis Mustafa Elattar

Heuristic Feature Selection Methods
1. Best Single Features (under the Feature Independence
Assumption):
1. This method evaluates each feature independently, typically based on its
statistical significance concerning the target variable.
2. Features are ranked based on their individual significance, and a threshold (like
a p-value in hypothesis testing) is used to select the most significant features.
3. The underlying assumption is that the features are independent of each other,
which might not always hold true.
Remember, the decision tree first
example
2. Best Step-wise Feature Selection:
1. This is a greedy algorithm that starts with no features and adds them one at a
time.
2. Initially, the best single feature is chosen.
3. In the next step, the algorithm selects another feature that, combined with the
first, provides the best performance.
4. This process continues, always selecting the next best feature given the
features already selected, until a stopping criterion is met (like a predefined
number of features or a performance threshold).
Remember, the decision tree
3. Step-wise Feature Elimination: second example
1. This is the reverse of the previous method. It starts with all the features.
2. In each step, it eliminates the feature that, when removed, results in the
smallest decrease in performance or results in significant increase in
performance (or, equivalently, the worst-performing feature).
3. This continues until a stopping criterion is met.

CS322 – Data Analysis Mustafa Elattar

Data Reduction: 3. Numerosity Reduction
◼ Reduction Focus: Numerosity reduction techniques reduce the volume of
data records while preserving key information. This enable to store and
visualize the model of data instead of whole data, for example regression
models.This can involve aggregating data points or using smaller data
representations. Do you think that numerosity reduction is mainly applied to
data for better modeling like what we explained in dimension reduction and
feature selection? (not all numerosity reduction tech. are beneficial to the
model, but some are such as, sampling)

◼ Applied to: Data Points (Records).

◼ Popular methods: {Regression, Log-Linear Model,….} and {Histograms,
Clustering, Aggregation, Sampling,…..}

◼ Numerosity Reduction techniques

◼ Regression
◼ Histograms analysis, Clustering, and Sampling

CS322 – Data Analysis Mustafa Elattar and Noha Gamal 14

3- Numerosity Reduction:
• Example: You are working with a large e-commerce dataset containing
individual customer transactions. You want to reduce data volume by
grouping transactions made by the same customer within a certain time
frame.
To reduce data volume through numerosity reduction, you can
cluster(group) transactions made by the same customer within a three-day
time frame:

Transaction Customer Amount Transaction

ID ID (USD) Date
Total Amount Cluster Start Cluster End
Customer ID (USD) Date Date 1 1001 50 2023-01-15
1001 110 2023-01-15 2023-01-18 2 1002 30 2023-01-16
1002 55 2023-01-16 2023-01-22 3 1001 60 2023-01-18
1003 40 2023-01-20 2023-01-20 4 1003 40 2023-01-20
5 1002 25 2023-01-22

CS322 – Data Analysis Mustafa Elattar and Noha Gamal 15

Numerosity Reduction Methods

● In short, Numerosity Reduction Methods, Reduce data volume by choosing

alternative, smaller forms of data representation
○ Parametric methods (e.g., regression)
■ Assume the data fits some model, estimate model parameters,
store only the parameters, and discard the data (except possible
outliers)
○ Non-parametric methods
■ Do not assume models
■ Major families: histograms, clustering, sampling, …
○ Parametric methods, like regression, assume a specific model and store
parameters of that model to represent the data.
○ Non-parametric methods, like clustering, don't assume a model and try to
represent the data in a condensed form based on inherent structures or
patterns in the data itself.

CS322 – Data Analysis Mustafa Elattar and Noha Gamal 16

Linear Regression

● Example: Regression
● Imagine a dataset with a linear trend. Instead of storing every data point,
we can fit a linear regression line to the data and store only the slope and
intercept of the line. (y=ax+b), we save only, a, and b (that’s why it is called
parametric)
In summary, the idea behind
data point reduction using
regression is to leverage the
trend-capturing ability of
regression models to represent
large datasets with fewer, but
still representative, data points.

CS322 – Data Analysis Mustafa Elattar and Noha Gamal 17

Histogram Analysis
● Divide data into buckets (bins) and store average (sum, borders(min-max),…) for each bucket.

● Partitioning rules: Consider an exam results

○ Equal-width: equal bucket range

○ Equal-frequency (or equal-depth)

CS322 – Data Analysis Mustafa Elattar and Noha Gamal 18

Histogram Analysis
● Divide data into buckets (bins) and store average (sum, borders(min-max),…) for each bucket.

● Partitioning rules: Consider a real-world scenario where a company collects

data on the time users spend on their website. If data is
○ Equal-width: equal bucket range collected every second for millions of users, the dataset
becomes massive. Instead of storing every data point, the
○ Equal-frequency (or equal-depth) company can use histograms to represent data in bins of,
say, 10 seconds. Now, instead of knowing that 10,000 users
spent exactly 31 seconds, 12,000 users spent 32 seconds,
and so on, the company can know that 100,000 users spent
between 30 to 40 seconds. This is a much more compact
representation. it's essential to understand that some detail
From this visualization: is lost in this process, that’s why, histograms are for
•The equal-width histogram visualization not modeling.
offers a straightforward
view of how user-time is
distributed across fixed
time intervals. The count of
users spending
•The equal-frequency 0-15 seconds is
histogram emphasizes the equal to The
concentrations and count of users
spending 100-
variations in the data, 300 seconds
showcasing where most
users lie within the time
spectrum.

CS322 – Data Analysis Mustafa Elattar and Noha Gamal 19

Steps to perform equal-width binning:

1. Sort the Data: Arrange the data in ascending order.

2. Choose Bin Count: Decide the number of bins (k).
3. Calculate Bin Width: Compute the bin width by dividing the data range (values on
the x-axis) by k.
4. Create Bins: Divide the data range into intervals of equal width.

CS322 – Data Analysis Mustafa Elattar and Noha Gamal 20

Steps to perform equal-depth binning:

1. Sort the Data: Start by sorting the dataset in ascending order.

2. Determine Bin Count: Decide how many bins you want. Let this be k.
3. Calculate Bin Size: Divide the total number of data points by the number of bins k.
Let's call this n. In many cases, n will not be an integer. You might have to adjust the
number of data points in some bins.
4. Create Bins: Use the sorted data to create bins. The first bin will have the first n data
points, the second bin will have the next n data points, and so on.
5. Determine Bin Ranges: The range for each bin will be from the smallest value in the
bin (inclusive) to the smallest value in the next bin (exclusive), except for the last bin
which includes its upper boundary.

CS322 – Data Analysis Mustafa Elattar and Noha Gamal 21

scores =
[2,5,3,20,18,7,16,18,18,2,2,17,14,3,20,3,16
,6,7,3,16,16,15,5,20,13,16,20,12,14,13,3,14
,18,14,14,16,18,19,3,5,2,5,14,20,17,3,17,16
,3,2,19,3,9,13,4,3,16,14,13,13,16,20,14,4,2
,3,18,7,3,5,3,6,9,18,3,16,18,20,18,5,18,5,1
8,13,14,19,13,14,3,14,18,14,18,18,16,19,5,3
,17,18,3,19,3,20,9,16,12,20,8,12,13,13,19,1
8,6,3,2,18,6]

CS322 – Data Analysis Mustafa Elattar and Noha Gamal 22

[2,5[,[5,14[,[14,18[,[18,20]

● 120 elements
Scores.sort() = [2, 2, 2, 2, 2, 2, 2, 3, 3, 3,
3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3,
3, 3, 4, 4, 5, 5, 5, 5, 5, 5, 5, 5, 6, 6,
6, 6, 7, 7, 7, 8, 9, 9, 9, 12, 12, 12, 13,
13, 13, 13, 13, 13, 13, 13, 13, 14, 14, 14,
14, 14, 14, 14, 14, 14, 14, 14, 14, 15, 16,
16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16,
17, 17, 17, 17, 18, 18, 18, 18, 18, 18, 18,
18, 18, 18, 18, 18, 18, 18, 18, 18, 18, 19,
19, 19, 19, 19, 19, 20, 20, 20, 20, 20, 20,
20, 20, 20]

CS322 – Data Analysis Mustafa Elattar and Noha Gamal 23

Counts of data samples in each 1 bin of width 1 =[7,19,2,8,4,3, 1, 3,…………]

CS322 – Data Analysis Mustafa Elattar and Noha Gamal 24

CS322 – Data Analysis Mustafa Elattar and Noha Gamal 25
Bins_edges (for 4 bins each of 30 data
points)=[2, 5, 14, 18, 20]

CS322 – Data Analysis Mustafa Elattar and Noha Gamal 26

Clustering
● Partition data set into clusters based on similarity, and store cluster
representation (e.g., centroid and diameter) only
● Can be very effective if data is clustered but not if data is “messy”

● There are many choices of clustering definitions and clustering

algorithms that we will explore soon.

Clustering is not considered a parametric

dimensionality reduction technique because it doesn't
assume any underlying parameterized distribution for
the data. Instead, it groups data based on similarity
without any predefined model structure.
On the other hand, regression is considered parametric
because it assumes an underlying form or model (e.g.,
linear) for the data. Parameters of this model (like slope
and intercept in linear regression) are then estimated Ex.
from the data. Using KMeans clustering to
group data into 3 clusters

CS322 – Data Analysis Mustafa Elattar and Noha Gamal 27

Sampling

● Sampling: obtaining a small sample S to represent the whole data set N

● Allow a data analysis algorithm to run in a reduced complexity that is

potentially sub-linear to the size of the data (i.e. < O(N))
● Key principle: Choose a representative subset of the data
○ Simple random sampling may have very poor performance in the
presence of skewed distribution in the original data (skewness
means, the data is not normally distributed, which can cause poor
sampling when simple random is applied)
○ Develop adaptive sampling methods, e.g., stratified sampling:

CS322 – Data Analysis Mustafa Elattar and Noha Gamal 28

Types of Sampling
● Simple random sampling
○ There is an equal probability of selecting any particular
item
○ Sampling without replacement

○ Once an object is selected, it is removed from the

population
○ Sampling with replacement

○ A selected object is not removed from the population

● Stratified sampling:
○ Partition the data set, and draw samples from each
partition (proportionally, i.e., approximately the same
percentage of the data)
○ Used in conjunction with skewed data
CS322 – Data Analysis Mustafa Elattar and Noha Gamal
29
Sampling: With or without Replacement

Raw Data
CS322 – Data Analysis Mustafa Elattar and Noha Gamal 30
Data Reduction: 4. Data Compression

◼ Reduction Focus: The process of reducing the size of a

dataset, typically with the aim to save space or reduce
transmission time.
◼ While not traditionally "compression“, Dimensionality and
numerosity reduction may also be considered as forms of
data compression

CS322 – Data Analysis Mustafa Elattar and Noha Gamal 31

4- Data Compression

In data compression, transformations are applied so as to obtain a

reduced or “compressed” representation of the original data.

If the original data can be

reconstructed from the
Compressed
compressed data without any Original Data
Data
information loss, the data lossless
reduction is called lossless. If,
instead, we can reconstruct only
an approximation of the original
data, then the data reduction is Original Data
Approximated
called lossy.

32
4- Data Compression
● String compression
○ This is specifically for compressing textual data.
There are various algorithms, such as Huffman Example using RLE: Original String:
coding, Run-Length Encoding (RLE), and "WWWWWWWWXXZZZ"
Lempel-Ziv-Welch (LZW). Compressed String: "8W2X3Z"

○ Typically lossless, but only limited manipulation

is possible without expansion
● Audio/video compression Example Youtube: If you notice, the
video initially loads in a lower quality
○ Typically lossy compression, with progressive (blurry) and then gradually increases
refinement, Common algorithms include MP3 for in quality. This is due to the
progressive refinement of lossy
audio and MPEG for video compression.
○ Sometimes small fragments of signal can be
Example: If you have temperature
reconstructed without reconstructing the whole readings as: 12:00 PM: 25°C, 12:01
thing PM: 25.1°C, 12:02 PM: 25.2°C...
Instead of storing every reading,
● Time sequence excluding audio you can store the starting value
and the rate of change
○ Typically short and vary slowly with time (temp.,
stock prices)
CS322 – Data Analysis Mustafa Elattar and Noha Gamal 33
Outline

● Data Pre-processing: An Overview

○ Data Quality
○ Major Tasks in Data Pre-processing
● Data Cleaning

● Data Integration

● Data Reduction

● Data Transformation and Data Discretization

● Summary

CS322 – Data Analysis Mustafa Elattar and Noha Gamal 34

Data Transformation
• Transformation Focus: Data transformation involves converting the data into a
format, structure, or value that makes it more suitable for analysis. It can help in
normalizing scales, handling skewed data, and improving the performance
of certain algorithms that are sensitive to feature scales or distributions.
• Applied to: Data Values.
● Popular methods:
○ Smoothing: Remove noise from data
○ Attribute/feature construction
■ New attributes constructed from the
given ones
○ Aggregation: Summarization, data cube construction
○ Normalization: Scaled to fall within a smaller, specified range
■ min-max normalization
■ z-score normalization
■ normalization by decimal scaling
○ Discretization: Concept hierarchy climbing
CS322 – Data Analysis Mustafa Elattar and Noha Gamal
Data Transformation
Example: Consider a dataset of houses with features like area (ranging
from 500 to 5000 sq. ft) and number of rooms (typically ranging from 1 to
5). If you plot the raw data on a graph, the area values will dominate due
to their larger magnitude, potentially causing algorithms to prioritize area
over the number of rooms (feature rank).
From these visualizations, it's evident that the "area" feature has a more
dominant effect on the price due to its larger scale compared to the
"number of rooms" feature. This can lead algorithms to prioritize the
"area" feature more, potentially overshadowing the influence of the
"number of rooms" on the outcome (in this case, price).

CS322 – Data Analysis Mustafa Elattar and Noha Gamal

Normalization

● Min-max normalization: to [new_minA, new_maxA], Given data, this

method scales the data to fall between a specified range, typically [0, 1]
v − minA
v' = (new _ maxA − new _ minA) + new _ minA
maxA − minA
● Ex. Let income range $12,000 to $98,000 normalized to [0.0, 1.0].
Then $73,600 is mapped to 73,600 − 12,000
(1.0 − 0) + 0 = 0.716
98,000 − 12,000
● Z-score normalization (μ: mean, σ: standard deviation): μ~0, σ~1
○ Ex. Let μ = 54,000, σ = 16,000. Then v ' =
v − A 73,600 − 54,000
= 1.225
● Normalization by decimal scaling
A 16,000

v Where j is the smallest integer such that Max(|ν’|) < 1

v' = j
10
CS322 – Data Analysis Mustafa Elattar and Noha Gamal 37
Discretization

● Three types of attributes

○ Nominal—values from an unordered set, e.g., color, profession,….

○ Ordinal—values from an ordered set, e.g., military or academic rank
○ Numeric—real numbers, e.g., integer or real numbers
● Discretization: Data discretization is a process used in data
preprocessing to transform continuous data into discrete or categorical
data. Divide the range of a continuous attribute into intervals
○ Interval labels can then be used to replace actual data values
Example: For a continuous attribute, say age which
ranges from 0 to 100, you can divide it into intervals like:
•0-10 (labelled as 'Child')
•11-20 (labelled as 'Teen')
•21-60 (labelled as 'Adult')
•61-100 (labelled as 'Senior')
CS322 – Data Analysis Mustafa Elattar and Noha Gamal 38
Discretization Age infected?
23 Yes
● Discretization: Divide the range of a continuous attribute
30 No
into intervals
28 Yes
○ Interval labels can then be used to replace actual data 35 No
values
40 No
○ Supervised vs. unsupervised (Supervised: deciding the 22 Yes
discretization criteria based on the target label,
Unsupervised: the vice versa, it depends on feature Supervised: you might choose
values only) to discretize age into two
categories based on the
○ Either supervised or not: Split (top-down) vs. merge presence of the disease:
(bottom-up) •Young (<= 30)
•Middle-Aged (> 30)
○ Discretization can be performed recursively on an
attribute In fact, it Age infected
Age Age is not Young Yes
handled
23 20-30 Middle-aged No
in this
30 30-40 naive Young Yes
Unsupervised way,
28 20-30 Middle-aged No
some
35 30-40 advanced Middle-aged No
methods Young Yes
40 30-40
are
22 20-30 typically
CS322 – Data Analysis Mustafa Elattar and Noha Gamal 39
applied
Data Discretization Methods
Split (top-down): Start with all data in one interval and split it into smaller intervals based on
certain criteria.
Merge (bottom-up): Start with each data point in its own interval and merge them into larger
intervals based on certain criteria.

● Typical methods: All the methods can be applied

recursively
○ Clustering analysis (unsupervised, top-down split or
bottom-up merge)
○ Decision-tree analysis (supervised, top-down split)
○ Correlation (e.g., X2) analysis
(unsupervised/supervised, bottom-up merge)
○ Binning
■ Top-down split (bins are decided before
applying to data points), unsupervised
○ Histogram analysis
■ Top-down split, unsupervised
CS322 – Data Analysis Mustafa Elattar and Noha Gamal 40
Data Discretization Methods
Using the correlation allowed us to create meaningful
categories for "Study Hours" based on how they influenced
"Exam Score" . Bottom-up as each point was investigated
alone then merged with similar points later

Clustering (unsupervised,
bottom-up merge)

correlation (supervised, bottom-

up merge), r=0.8
CS322 – Data Analysis Mustafa Elattar and Noha Gamal 41
Binning

● Equal-width (distance) partitioning

○ Divides the range into N intervals (bin) of equal size: uniform grid
○ if A and B are the lowest and highest values of the attribute, the width of
intervals (bins) will be: W = (B–A)/N.
○ The most straightforward, but outliers may dominate presentation
○ Skewed data is not handled well
● Equal-depth (frequency) partitioning
○ Divides the range into B bins or intervals, each containing approximately
same number of samples, if Number of samples= S, then the expected
depth (frequency) will be S/B
○ Good data scaling
○ Managing categorical attributes can be tricky

CS322 – Data Analysis Mustafa Elattar and Noha Gamal 42

Binning Methods for Data Smoothing

● Sorted data for price (in dollars): 4, 8, 9, 15, 21, 21, 24, 25, 26, 28, 29, 34
● Partition into equal-frequency (equi-depth) bins:
○ Bin 1: 4, 8, 9, 15
○ Bin 2: 21, 21, 24, 25
○ Bin 3: 26, 28, 29, 34
● Smoothing by bin means:
○ Bin 1: 9, 9, 9, 9
○ Bin 2: 23, 23, 23, 23
○ Bin 3: 29, 29, 29, 29
● Smoothing by bin boundaries:
○ Bin 1: 4, 4, 4, 15
○ Bin 2: 21, 21, 25, 25
○ Bin 3: 26, 26, 26, 34
CS322 – Data Analysis Mustafa Elattar and Noha Gamal 43
Binning vs. Clustering

Data Equal interval width (binning)

Equal frequency (binning) K-means clustering leads to better results

CS322 – Data Analysis Mustafa Elattar and Noha Gamal 44
Outline

● Data Pre-processing: An Overview

○ Data Quality
○ Major Tasks in Data Pre-processing
● Data Cleaning

● Data Integration

● Data Reduction

● Data Transformation and Data Discretization

● Summary

CS322 – Data Analysis Mustafa Elattar and Noha Gamal 45

Summary

● Data quality: accuracy, ● Data reduction

completeness, consistency,
○ Dimensionality reduction
timeliness, believability,
interpretability ○ Numerosity reduction
● Data cleaning: e.g. missing/noisy ○ Data compression
values, outliers
● Data transformation and data
● Data integration from multiple discretization
sources:
○ Normalization
○ Entity identification problem
○ Concept hierarchy
○ Remove redundancies generation
○ Detect inconsistencies

Lecture 4 - Bias-Variance Trade-Off and Model Selection
No ratings yet
Lecture 4 - Bias-Variance Trade-Off and Model Selection
66 pages
Data Preprocessing
No ratings yet
Data Preprocessing
33 pages
03preprocessing3 Part3 4
No ratings yet
03preprocessing3 Part3 4
49 pages
r20 DWDM Unit 2 PART 2
No ratings yet
r20 DWDM Unit 2 PART 2
15 pages
Data Reduction Techniques
No ratings yet
Data Reduction Techniques
41 pages
6 - Data Pre-Processing-III
No ratings yet
6 - Data Pre-Processing-III
30 pages
Data Mining - Lecture 3
No ratings yet
Data Mining - Lecture 3
33 pages
M 2.2 8data Reduction
No ratings yet
M 2.2 8data Reduction
34 pages
Data Science Unit I (LN and QB)
No ratings yet
Data Science Unit I (LN and QB)
44 pages
ML Lecture 6 7 Preprocess
No ratings yet
ML Lecture 6 7 Preprocess
43 pages
Feature Selection
No ratings yet
Feature Selection
13 pages
Module-3 DSV
No ratings yet
Module-3 DSV
20 pages
Conference 101719
No ratings yet
Conference 101719
7 pages
CH1-data Preprocessing
No ratings yet
CH1-data Preprocessing
49 pages
Comparartive
No ratings yet
Comparartive
7 pages
KNIME - Seven Techs For Dimensionality Reduction
No ratings yet
KNIME - Seven Techs For Dimensionality Reduction
17 pages
Pca (Data Reduction)
No ratings yet
Pca (Data Reduction)
24 pages
Data Preprocessing-2
No ratings yet
Data Preprocessing-2
30 pages
Introduction To Dimensionality Reduction-1
No ratings yet
Introduction To Dimensionality Reduction-1
16 pages
Dimensionality Reduction
No ratings yet
Dimensionality Reduction
47 pages
Data Mining
No ratings yet
Data Mining
21 pages
Data Pre-Processing: - Data Cleaning - Data Integration - Data Transformation - Data Reduction - Data Discretization
No ratings yet
Data Pre-Processing: - Data Cleaning - Data Integration - Data Transformation - Data Reduction - Data Discretization
55 pages
Business Data Mining Week 4
No ratings yet
Business Data Mining Week 4
12 pages
IEEE Dimensionality Reduction
No ratings yet
IEEE Dimensionality Reduction
6 pages
CIS664-Knowledge Discovery and Data Mining
No ratings yet
CIS664-Knowledge Discovery and Data Mining
52 pages
Data Mining - Data Reduction
No ratings yet
Data Mining - Data Reduction
6 pages
Lecture6a DataPreprocessing
No ratings yet
Lecture6a DataPreprocessing
52 pages
Data Reduction
No ratings yet
Data Reduction
22 pages
Data Reduction
No ratings yet
Data Reduction
23 pages
Feature Selection: Slide 1
No ratings yet
Feature Selection: Slide 1
29 pages
DM 2 Part 2
No ratings yet
DM 2 Part 2
35 pages
Week 2
No ratings yet
Week 2
96 pages
R21 Unit 2
No ratings yet
R21 Unit 2
101 pages
Preprocessing
No ratings yet
Preprocessing
52 pages
Data Preprocessingedfgh
No ratings yet
Data Preprocessingedfgh
21 pages
Data Mining: Concepts and Techniques
No ratings yet
Data Mining: Concepts and Techniques
50 pages
AI5003 AML Week07
No ratings yet
AI5003 AML Week07
14 pages
Conference 101719
No ratings yet
Conference 101719
7 pages
3.1 Dimensionality Reduction
No ratings yet
3.1 Dimensionality Reduction
24 pages
Data Preprocessing
No ratings yet
Data Preprocessing
21 pages
Spatial and Temporal Data Mining
No ratings yet
Spatial and Temporal Data Mining
52 pages
Data Quality and Preprocessing
No ratings yet
Data Quality and Preprocessing
60 pages
HIT391-week 3-New
No ratings yet
HIT391-week 3-New
43 pages
Research Citation Notes
No ratings yet
Research Citation Notes
35 pages
6 Data Preprocessing
No ratings yet
6 Data Preprocessing
37 pages
L5 Dimensionality Reduction
No ratings yet
L5 Dimensionality Reduction
47 pages
17 Data Analysis
No ratings yet
17 Data Analysis
64 pages
5 Data Pre Processing III
No ratings yet
5 Data Pre Processing III
30 pages
Data Reduction
No ratings yet
Data Reduction
23 pages
CH 3-Final
No ratings yet
CH 3-Final
39 pages
CSC 522 Lecture3
No ratings yet
CSC 522 Lecture3
32 pages
Dimensionality Reduction
No ratings yet
Dimensionality Reduction
117 pages
Big Data Lecture # 04
No ratings yet
Big Data Lecture # 04
22 pages
Principal Component Analysis (PCA)
No ratings yet
Principal Component Analysis (PCA)
56 pages
Wrapper Method
No ratings yet
Wrapper Method
58 pages
Dpir Ia1
No ratings yet
Dpir Ia1
13 pages
Data Preprocessing
No ratings yet
Data Preprocessing
39 pages
E-Note 14653 Content Document 20231228101402AM
No ratings yet
E-Note 14653 Content Document 20231228101402AM
10 pages
Feature Selection and Extraction
No ratings yet
Feature Selection and Extraction
26 pages
1 of 1
No ratings yet
1 of 1
41 pages
Alternating Decision Tree: Fundamentals and Applications
From Everand
Alternating Decision Tree: Fundamentals and Applications
Fouad Sabry
No ratings yet
Lecture 2
No ratings yet
Lecture 2
33 pages
Lecture 6
No ratings yet
Lecture 6
51 pages
Lecture 4
No ratings yet
Lecture 4
52 pages
Lecture 3
No ratings yet
Lecture 3
48 pages
01.14.pyramidal Implementation of The Lucas Kanade Feature Tracker - Description of The Algorithm
No ratings yet
01.14.pyramidal Implementation of The Lucas Kanade Feature Tracker - Description of The Algorithm
9 pages
? DSML U4
No ratings yet
? DSML U4
27 pages
Diabetes Data Analysis Using Python Report
No ratings yet
Diabetes Data Analysis Using Python Report
15 pages
Machine Learning in Big Data
No ratings yet
Machine Learning in Big Data
10 pages
A Comprehensive Survey On Machine Learning For Networking
No ratings yet
A Comprehensive Survey On Machine Learning For Networking
99 pages
Nikhil Major Project
No ratings yet
Nikhil Major Project
60 pages
A Survey of Feature Selection and Feature Extraction Techniques in Machine Learning
No ratings yet
A Survey of Feature Selection and Feature Extraction Techniques in Machine Learning
7 pages
An AI Based Intelligent System For Healthcare Aanalysis Using Ridge-Adaline Stochastic Gradient Descent Classifier
No ratings yet
An AI Based Intelligent System For Healthcare Aanalysis Using Ridge-Adaline Stochastic Gradient Descent Classifier
20 pages
Android Based Malware Detection Technique Using Machine Learning Algorithms
No ratings yet
Android Based Malware Detection Technique Using Machine Learning Algorithms
6 pages
UNIT3
No ratings yet
UNIT3
98 pages
Computing Efficient Features Using Rough Set Theory Combined With Ensemble Classification Techniques To Improve The Customer Churn Prediction in Telecommunication Sector
No ratings yet
Computing Efficient Features Using Rough Set Theory Combined With Ensemble Classification Techniques To Improve The Customer Churn Prediction in Telecommunication Sector
22 pages
Machine Learning Methods
No ratings yet
Machine Learning Methods
27 pages
Coastal Sentiment Review Using Naïve Bayes With Feature Selection Genetic Algorithm
No ratings yet
Coastal Sentiment Review Using Naïve Bayes With Feature Selection Genetic Algorithm
10 pages
05 Kaggle Competition
No ratings yet
05 Kaggle Competition
37 pages
Ad3002 Health Care Analytics
No ratings yet
Ad3002 Health Care Analytics
76 pages
Pert 3 Advanced Feature Selection Teqnique
No ratings yet
Pert 3 Advanced Feature Selection Teqnique
58 pages
Data Science Project PDF
No ratings yet
Data Science Project PDF
10 pages
CertyIQ AI-900 NewExamDumps 40ImpQue-2023
0% (1)
CertyIQ AI-900 NewExamDumps 40ImpQue-2023
89 pages
Feature Selection in Machine Learning
No ratings yet
Feature Selection in Machine Learning
34 pages
TEA EKHO IDS: An Intrusion Detection System For Industrial CPS With Trustworthy Explainable AI and Enhanced Krill Herd Optimization
No ratings yet
TEA EKHO IDS: An Intrusion Detection System For Industrial CPS With Trustworthy Explainable AI and Enhanced Krill Herd Optimization
29 pages
Mini
No ratings yet
Mini
65 pages
A Feature Selection Technique Based Approach For Predicting Student 2021
No ratings yet
A Feature Selection Technique Based Approach For Predicting Student 2021
10 pages
Algorithms 20130703 PDF
No ratings yet
Algorithms 20130703 PDF
53 pages
Iterative Predictor Weighting (IPW) PLS: A Technique For The Elimination of Useless Predictors in Regression Problems
No ratings yet
Iterative Predictor Weighting (IPW) PLS: A Technique For The Elimination of Useless Predictors in Regression Problems
21 pages
Unit No. 02 - Feature Extraction & Selection
No ratings yet
Unit No. 02 - Feature Extraction & Selection
47 pages
Feature Selection Techniques in ML With Python-1
No ratings yet
Feature Selection Techniques in ML With Python-1
7 pages
AutoML A Survey of State-Of-The-Art
No ratings yet
AutoML A Survey of State-Of-The-Art
33 pages
RDL Course Syllabus New
100% (1)
RDL Course Syllabus New
2 pages
1.1.3 Transcript
No ratings yet
1.1.3 Transcript
1 page