0% found this document useful (0 votes)
35 views45 pages

CS322 Lec5 S25

The document outlines the key concepts of data preprocessing in data analysis, focusing on data quality and major tasks such as data cleaning, integration, reduction, and transformation. It emphasizes techniques for attribute subset selection and numerosity reduction, including methods like decision tree induction and regression. The document also discusses heuristic feature selection methods to improve model performance by reducing irrelevant or redundant data.

Uploaded by

aserhesham99
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
35 views45 pages

CS322 Lec5 S25

The document outlines the key concepts of data preprocessing in data analysis, focusing on data quality and major tasks such as data cleaning, integration, reduction, and transformation. It emphasizes techniques for attribute subset selection and numerosity reduction, including methods like decision tree induction and regression. The document also discusses heuristic feature selection methods to improve model performance by reducing irrelevant or redundant data.

Uploaded by

aserhesham99
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 45

CSCI322:

CSCI322: Data
Data
Analysis

Lecture 1:
5: Data Types,
Preprocessing
Collection,
(Part 3) Sampling

Dr.
Dr. Noha
Noha Gamal,
Gamal, Dr.
Dr. Mona
Shimaa
Arafa, and Dr. Mustafa Elattar
Mohamed
Outline

● Data Pre-processing: An Overview

○ Data Quality
● Major Tasks in Data Pre-processing

o Data Cleaning

o Data Integration

o Data Reduction

o Data Transformation and Data Discretization


● Summary

CS322 – Data Analysis Mustafa Elattar and Noha Gamal 2


Data Reduction: 2. Attribute Subset Selection
◼ Reduction Focus: This technique reduces data by selecting a subset of
relevant features (attributes) and discarding irrelevant or redundant ones. It
can simplify the model, reduce overfitting, and potentially improve model
performance.

◼ Applied to: Data Features (Attributes).

◼ Popular methods: {Chi-square, Correlation, Tree-based Models, Heuristic


selection methods, Recursive feature elimination}

◼ Attribute Subset Selection (Feature Selection) techniques


◼ Decision Tree Induction
◼ Heuristic selection methods

CS322 – Data Analysis Mustafa Elattar and Noha Gamal 3


Attribute Subset Selection (Feature Selection):
• Example: In a dataset of online product reviews, you want to identify the
most influential features (words or phrases) in predicting product sentiment
(positive or negative). You start with a dataset containing various features,
such as the frequency of individual words and phrases in each review.
To reduce the data through feature Co-founding features (independent/explanatory/predictors
Response
(dep.
selection, you might use techniques like features/variables/attributes) variable)

chi-squared tests to identify which Review


ID Excellent Poor Quality Affordable ... Sentiment
features are most relevant for sentiment 1 3 1 2 2 ... Positive
analysis. 2 0 5 2 1 ... Negative

To use Chi-Square, Construct a 3 4 0 3 3 ... Positive

contingency table for each feature. For 4 0 3 2 2 ... Negative

5 2 2 5 4 ... Positive
example, the contingency table for the
"Excellent" feature would look like this: Review
ID Excellent Sentiment
Sentiment Sentiment
1 3 Positive
= Positive = Negative
2 0 Negative
Excellent =
0 (count) 2 (count) 3 4 Positive
0
4 0 Negative
Excellent > 5 2 Positive
3 (count) 0 (count)
0

CS322 – Data Analysis Mustafa Elattar and Noha Gamal 4


Decision Tree Induction (Supervised method)

Initial attribute set: {A1, A2, A3, A4, A5, A6}

A4 ?

A1? A6?

Class 1 Class 2 Class 1 Class 2

> Reduced attribute set: {A1, A4, A6}


Decision Tree Induction (Supervised method)

Initial attribute set: {Height, Weight, Age, Job, Marital Status, Number of Kids}, Target {Gender}

Height
>=165

Y
Weight>=80 Weight>=70

Y
Y

Class 1 Class 2 Class 1 Class 2

> Reduced attribute set: {Weight, Height}


Example: Let's consider a dataset where we want to predict if a student will pass
(1) or fail (0) based on three features: study hours, attendance rate, and student ID.
Decision trees inherently perform feature selection by choosing which features to split on
based on certain criteria (like Gini impurity (decrease) or information gain (increase))
Initial attribute set: {A1, A2, A3} stude study_ attendance
A1 or A2 gave the same purity nt_id hours _rate passed
A3 is irrelevant 101 5 0.8 1
Selected Attribute set: {A1} 102 2 0.5 0
103 6 0.9 1
104 3 0.6 0
105 7 0.95 1

Redundant Irrelevant
(OVERFITTING)

CS322 – Data Analysis Mustafa Elattar and Noha Gamal 8


study_
hours Attendance passed What if we have ambiguity in
the selected feature:
5 5 1 accuracy will decrease
2 0 0
The condition will
6 5 1
Gini=(1-(3/7)^2-(4/7)^2) be applied for
3 1 0 splitting
5 1 0
Current segment
6 5 1 characteristics
7 5 1
Y
N

Gini=(1-(3/4)^2-(1/4)^2)

Y
N
This segment is not
representing a pure class

Gini=(1-(1/2)^2-(1/2)^2)

CS322 – Data Analysis Mustafa Elattar and Noha Gamal


study_
hours Attendance passed What if we have ambiguity in
the selected feature:
5 5 1 accuracy will decrease
2 0 0
6 5 1
Gini=(1-(3/7)^2-(4/7)^2)
3 1 0
5 1 0
6 5 1
7 5 1

What ever we do using the same feature will lead to degraded


accuracy, so the solution is to employ another feature that can help in
increasing the Gini=(1-(3/4)^2-(1/4)^2)
accuracy such as the attendance rate and re-build the
model

Gini=(1-(1/2)^2-(1/2)^2)

CS322 – Data Analysis Mustafa Elattar and Noha Gamal


study_
hours Attendance passed What if we have ambiguity in
the selected feature:
5 5 1 accuracy will decrease
2 0 0
The condition will
6 5 1
Gini=(1-(3/7)^2-(4/7)^2) be applied for
3 1 0 splitting
5 1 0
Current segment
6 5 1 characteristics
7 5 1
Y
N

Attendance
<5

Y
N

Fail Pass

CS322 – Data Analysis Mustafa Elattar and Noha Gamal


Heuristic Feature Selection Methods
● Heuristic (empirical/experimental) Feature Selection Methods refer to a set of
techniques designed to select a subset of the most important features from the
original feature set. The goal is to reduce the dataset while retaining as much of the
relevant information as possible. This can lead to simpler models, faster training times,
and sometimes even better performance.
● Considering a dataset with d features, there are 2d possible subsets of those features
(including the empty set and the full set). Exhaustively evaluating all possible
subsets is computationally infeasible when d is large. Hence, heuristic methods
are used to search the feature space more efficiently.
● Several heuristic feature selection methods:
○ Best single features under the feature independence assumption: choose by
significance tests.
○ Best step-wise feature selection:
■ The best single-feature is picked first
■ Then next best feature condition to the first, ...
○ Step-wise feature elimination:
■ Repeatedly eliminate the worst feature

CS322 – Data Analysis Mustafa Elattar


Heuristic Feature Selection Methods
1. Best Single Features (under the Feature Independence
Assumption):
1. This method evaluates each feature independently, typically based on its
statistical significance concerning the target variable.
2. Features are ranked based on their individual significance, and a threshold (like
a p-value in hypothesis testing) is used to select the most significant features.
3. The underlying assumption is that the features are independent of each other,
which might not always hold true.
Remember, the decision tree first
example
2. Best Step-wise Feature Selection:
1. This is a greedy algorithm that starts with no features and adds them one at a
time.
2. Initially, the best single feature is chosen.
3. In the next step, the algorithm selects another feature that, combined with the
first, provides the best performance.
4. This process continues, always selecting the next best feature given the
features already selected, until a stopping criterion is met (like a predefined
number of features or a performance threshold).
Remember, the decision tree
3. Step-wise Feature Elimination: second example
1. This is the reverse of the previous method. It starts with all the features.
2. In each step, it eliminates the feature that, when removed, results in the
smallest decrease in performance or results in significant increase in
performance (or, equivalently, the worst-performing feature).
3. This continues until a stopping criterion is met.

CS322 – Data Analysis Mustafa Elattar


Data Reduction: 3. Numerosity Reduction
◼ Reduction Focus: Numerosity reduction techniques reduce the volume of
data records while preserving key information. This enable to store and
visualize the model of data instead of whole data, for example regression
models.This can involve aggregating data points or using smaller data
representations. Do you think that numerosity reduction is mainly applied to
data for better modeling like what we explained in dimension reduction and
feature selection? (not all numerosity reduction tech. are beneficial to the
model, but some are such as, sampling)

◼ Applied to: Data Points (Records).


◼ Popular methods: {Regression, Log-Linear Model,….} and {Histograms,
Clustering, Aggregation, Sampling,…..}

◼ Numerosity Reduction techniques


◼ Regression
◼ Histograms analysis, Clustering, and Sampling

CS322 – Data Analysis Mustafa Elattar and Noha Gamal 14


3- Numerosity Reduction:
• Example: You are working with a large e-commerce dataset containing
individual customer transactions. You want to reduce data volume by
grouping transactions made by the same customer within a certain time
frame.
To reduce data volume through numerosity reduction, you can
cluster(group) transactions made by the same customer within a three-day
time frame:

Transaction Customer Amount Transaction


ID ID (USD) Date
Total Amount Cluster Start Cluster End
Customer ID (USD) Date Date 1 1001 50 2023-01-15
1001 110 2023-01-15 2023-01-18 2 1002 30 2023-01-16
1002 55 2023-01-16 2023-01-22 3 1001 60 2023-01-18
1003 40 2023-01-20 2023-01-20 4 1003 40 2023-01-20
5 1002 25 2023-01-22

CS322 – Data Analysis Mustafa Elattar and Noha Gamal 15


Numerosity Reduction Methods

● In short, Numerosity Reduction Methods, Reduce data volume by choosing


alternative, smaller forms of data representation
○ Parametric methods (e.g., regression)
■ Assume the data fits some model, estimate model parameters,
store only the parameters, and discard the data (except possible
outliers)
○ Non-parametric methods
■ Do not assume models
■ Major families: histograms, clustering, sampling, …
○ Parametric methods, like regression, assume a specific model and store
parameters of that model to represent the data.
○ Non-parametric methods, like clustering, don't assume a model and try to
represent the data in a condensed form based on inherent structures or
patterns in the data itself.

CS322 – Data Analysis Mustafa Elattar and Noha Gamal 16


Linear Regression

● Example: Regression
● Imagine a dataset with a linear trend. Instead of storing every data point,
we can fit a linear regression line to the data and store only the slope and
intercept of the line. (y=ax+b), we save only, a, and b (that’s why it is called
parametric)
In summary, the idea behind
data point reduction using
regression is to leverage the
trend-capturing ability of
regression models to represent
large datasets with fewer, but
still representative, data points.

CS322 – Data Analysis Mustafa Elattar and Noha Gamal 17


Histogram Analysis
● Divide data into buckets (bins) and store average (sum, borders(min-max),…) for each bucket.

● Partitioning rules: Consider an exam results

○ Equal-width: equal bucket range

○ Equal-frequency (or equal-depth)

CS322 – Data Analysis Mustafa Elattar and Noha Gamal 18


Histogram Analysis
● Divide data into buckets (bins) and store average (sum, borders(min-max),…) for each bucket.

● Partitioning rules: Consider a real-world scenario where a company collects


data on the time users spend on their website. If data is
○ Equal-width: equal bucket range collected every second for millions of users, the dataset
becomes massive. Instead of storing every data point, the
○ Equal-frequency (or equal-depth) company can use histograms to represent data in bins of,
say, 10 seconds. Now, instead of knowing that 10,000 users
spent exactly 31 seconds, 12,000 users spent 32 seconds,
and so on, the company can know that 100,000 users spent
between 30 to 40 seconds. This is a much more compact
representation. it's essential to understand that some detail
From this visualization: is lost in this process, that’s why, histograms are for
•The equal-width histogram visualization not modeling.
offers a straightforward
view of how user-time is
distributed across fixed
time intervals. The count of
users spending
•The equal-frequency 0-15 seconds is
histogram emphasizes the equal to The
concentrations and count of users
spending 100-
variations in the data, 300 seconds
showcasing where most
users lie within the time
spectrum.

CS322 – Data Analysis Mustafa Elattar and Noha Gamal 19


Steps to perform equal-width binning:

1. Sort the Data: Arrange the data in ascending order.


2. Choose Bin Count: Decide the number of bins (k).
3. Calculate Bin Width: Compute the bin width by dividing the data range (values on
the x-axis) by k.
4. Create Bins: Divide the data range into intervals of equal width.

CS322 – Data Analysis Mustafa Elattar and Noha Gamal 20


Steps to perform equal-depth binning:

1. Sort the Data: Start by sorting the dataset in ascending order.


2. Determine Bin Count: Decide how many bins you want. Let this be k.
3. Calculate Bin Size: Divide the total number of data points by the number of bins k.
Let's call this n. In many cases, n will not be an integer. You might have to adjust the
number of data points in some bins.
4. Create Bins: Use the sorted data to create bins. The first bin will have the first n data
points, the second bin will have the next n data points, and so on.
5. Determine Bin Ranges: The range for each bin will be from the smallest value in the
bin (inclusive) to the smallest value in the next bin (exclusive), except for the last bin
which includes its upper boundary.

CS322 – Data Analysis Mustafa Elattar and Noha Gamal 21


scores =
[2,5,3,20,18,7,16,18,18,2,2,17,14,3,20,3,16
,6,7,3,16,16,15,5,20,13,16,20,12,14,13,3,14
,18,14,14,16,18,19,3,5,2,5,14,20,17,3,17,16
,3,2,19,3,9,13,4,3,16,14,13,13,16,20,14,4,2
,3,18,7,3,5,3,6,9,18,3,16,18,20,18,5,18,5,1
8,13,14,19,13,14,3,14,18,14,18,18,16,19,5,3
,17,18,3,19,3,20,9,16,12,20,8,12,13,13,19,1
8,6,3,2,18,6]

CS322 – Data Analysis Mustafa Elattar and Noha Gamal 22


[2,5[,[5,14[,[14,18[,[18,20]

● 120 elements
Scores.sort() = [2, 2, 2, 2, 2, 2, 2, 3, 3, 3,
3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3,
3, 3, 4, 4, 5, 5, 5, 5, 5, 5, 5, 5, 6, 6,
6, 6, 7, 7, 7, 8, 9, 9, 9, 12, 12, 12, 13,
13, 13, 13, 13, 13, 13, 13, 13, 14, 14, 14,
14, 14, 14, 14, 14, 14, 14, 14, 14, 15, 16,
16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16,
17, 17, 17, 17, 18, 18, 18, 18, 18, 18, 18,
18, 18, 18, 18, 18, 18, 18, 18, 18, 18, 19,
19, 19, 19, 19, 19, 20, 20, 20, 20, 20, 20,
20, 20, 20]

CS322 – Data Analysis Mustafa Elattar and Noha Gamal 23


Counts of data samples in each 1 bin of width 1 =[7,19,2,8,4,3, 1, 3,…………]

CS322 – Data Analysis Mustafa Elattar and Noha Gamal 24


CS322 – Data Analysis Mustafa Elattar and Noha Gamal 25
Bins_edges (for 4 bins each of 30 data
points)=[2, 5, 14, 18, 20]

CS322 – Data Analysis Mustafa Elattar and Noha Gamal 26


Clustering
● Partition data set into clusters based on similarity, and store cluster
representation (e.g., centroid and diameter) only
● Can be very effective if data is clustered but not if data is “messy”

● There are many choices of clustering definitions and clustering


algorithms that we will explore soon.

Clustering is not considered a parametric


dimensionality reduction technique because it doesn't
assume any underlying parameterized distribution for
the data. Instead, it groups data based on similarity
without any predefined model structure.
On the other hand, regression is considered parametric
because it assumes an underlying form or model (e.g.,
linear) for the data. Parameters of this model (like slope
and intercept in linear regression) are then estimated Ex.
from the data. Using KMeans clustering to
group data into 3 clusters

CS322 – Data Analysis Mustafa Elattar and Noha Gamal 27


Sampling

● Sampling: obtaining a small sample S to represent the whole data set N

● Allow a data analysis algorithm to run in a reduced complexity that is


potentially sub-linear to the size of the data (i.e. < O(N))
● Key principle: Choose a representative subset of the data
○ Simple random sampling may have very poor performance in the
presence of skewed distribution in the original data (skewness
means, the data is not normally distributed, which can cause poor
sampling when simple random is applied)
○ Develop adaptive sampling methods, e.g., stratified sampling:

CS322 – Data Analysis Mustafa Elattar and Noha Gamal 28


Types of Sampling
● Simple random sampling
○ There is an equal probability of selecting any particular
item
○ Sampling without replacement

○ Once an object is selected, it is removed from the


population
○ Sampling with replacement

○ A selected object is not removed from the population


● Stratified sampling:
○ Partition the data set, and draw samples from each
partition (proportionally, i.e., approximately the same
percentage of the data)
○ Used in conjunction with skewed data
CS322 – Data Analysis Mustafa Elattar and Noha Gamal
29
Sampling: With or without Replacement

Raw Data
CS322 – Data Analysis Mustafa Elattar and Noha Gamal 30
Data Reduction: 4. Data Compression

◼ Reduction Focus: The process of reducing the size of a


dataset, typically with the aim to save space or reduce
transmission time.
◼ While not traditionally "compression“, Dimensionality and
numerosity reduction may also be considered as forms of
data compression

CS322 – Data Analysis Mustafa Elattar and Noha Gamal 31


4- Data Compression

In data compression, transformations are applied so as to obtain a


reduced or “compressed” representation of the original data.

If the original data can be


reconstructed from the
Compressed
compressed data without any Original Data
Data
information loss, the data lossless
reduction is called lossless. If,
instead, we can reconstruct only
an approximation of the original
data, then the data reduction is Original Data
Approximated
called lossy.

32
4- Data Compression
● String compression
○ This is specifically for compressing textual data.
There are various algorithms, such as Huffman Example using RLE: Original String:
coding, Run-Length Encoding (RLE), and "WWWWWWWWXXZZZ"
Lempel-Ziv-Welch (LZW). Compressed String: "8W2X3Z"

○ Typically lossless, but only limited manipulation


is possible without expansion
● Audio/video compression Example Youtube: If you notice, the
video initially loads in a lower quality
○ Typically lossy compression, with progressive (blurry) and then gradually increases
refinement, Common algorithms include MP3 for in quality. This is due to the
progressive refinement of lossy
audio and MPEG for video compression.
○ Sometimes small fragments of signal can be
Example: If you have temperature
reconstructed without reconstructing the whole readings as: 12:00 PM: 25°C, 12:01
thing PM: 25.1°C, 12:02 PM: 25.2°C...
Instead of storing every reading,
● Time sequence excluding audio you can store the starting value
and the rate of change
○ Typically short and vary slowly with time (temp.,
stock prices)
CS322 – Data Analysis Mustafa Elattar and Noha Gamal 33
Outline

● Data Pre-processing: An Overview

○ Data Quality
○ Major Tasks in Data Pre-processing
● Data Cleaning

● Data Integration

● Data Reduction

● Data Transformation and Data Discretization


● Summary

CS322 – Data Analysis Mustafa Elattar and Noha Gamal 34


Data Transformation
• Transformation Focus: Data transformation involves converting the data into a
format, structure, or value that makes it more suitable for analysis. It can help in
normalizing scales, handling skewed data, and improving the performance
of certain algorithms that are sensitive to feature scales or distributions.
• Applied to: Data Values.
● Popular methods:
○ Smoothing: Remove noise from data
○ Attribute/feature construction
■ New attributes constructed from the
given ones
○ Aggregation: Summarization, data cube construction
○ Normalization: Scaled to fall within a smaller, specified range
■ min-max normalization
■ z-score normalization
■ normalization by decimal scaling
○ Discretization: Concept hierarchy climbing
CS322 – Data Analysis Mustafa Elattar and Noha Gamal
Data Transformation
Example: Consider a dataset of houses with features like area (ranging
from 500 to 5000 sq. ft) and number of rooms (typically ranging from 1 to
5). If you plot the raw data on a graph, the area values will dominate due
to their larger magnitude, potentially causing algorithms to prioritize area
over the number of rooms (feature rank).
From these visualizations, it's evident that the "area" feature has a more
dominant effect on the price due to its larger scale compared to the
"number of rooms" feature. This can lead algorithms to prioritize the
"area" feature more, potentially overshadowing the influence of the
"number of rooms" on the outcome (in this case, price).

CS322 – Data Analysis Mustafa Elattar and Noha Gamal


Normalization

● Min-max normalization: to [new_minA, new_maxA], Given data, this


method scales the data to fall between a specified range, typically [0, 1]
v − minA
v' = (new _ maxA − new _ minA) + new _ minA
maxA − minA
● Ex. Let income range $12,000 to $98,000 normalized to [0.0, 1.0].
Then $73,600 is mapped to 73,600 − 12,000
(1.0 − 0) + 0 = 0.716
98,000 − 12,000
● Z-score normalization (μ: mean, σ: standard deviation): μ~0, σ~1
○ Ex. Let μ = 54,000, σ = 16,000. Then v ' =
v − A 73,600 − 54,000
= 1.225
● Normalization by decimal scaling
A 16,000

v Where j is the smallest integer such that Max(|ν’|) < 1


v' = j
10
CS322 – Data Analysis Mustafa Elattar and Noha Gamal 37
Discretization

● Three types of attributes

○ Nominal—values from an unordered set, e.g., color, profession,….


○ Ordinal—values from an ordered set, e.g., military or academic rank
○ Numeric—real numbers, e.g., integer or real numbers
● Discretization: Data discretization is a process used in data
preprocessing to transform continuous data into discrete or categorical
data. Divide the range of a continuous attribute into intervals
○ Interval labels can then be used to replace actual data values
Example: For a continuous attribute, say age which
ranges from 0 to 100, you can divide it into intervals like:
•0-10 (labelled as 'Child')
•11-20 (labelled as 'Teen')
•21-60 (labelled as 'Adult')
•61-100 (labelled as 'Senior')
CS322 – Data Analysis Mustafa Elattar and Noha Gamal 38
Discretization Age infected?
23 Yes
● Discretization: Divide the range of a continuous attribute
30 No
into intervals
28 Yes
○ Interval labels can then be used to replace actual data 35 No
values
40 No
○ Supervised vs. unsupervised (Supervised: deciding the 22 Yes
discretization criteria based on the target label,
Unsupervised: the vice versa, it depends on feature Supervised: you might choose
values only) to discretize age into two
categories based on the
○ Either supervised or not: Split (top-down) vs. merge presence of the disease:
(bottom-up) •Young (<= 30)
•Middle-Aged (> 30)
○ Discretization can be performed recursively on an
attribute In fact, it Age infected
Age Age is not Young Yes
handled
23 20-30 Middle-aged No
in this
30 30-40 naive Young Yes
Unsupervised way,
28 20-30 Middle-aged No
some
35 30-40 advanced Middle-aged No
methods Young Yes
40 30-40
are
22 20-30 typically
CS322 – Data Analysis Mustafa Elattar and Noha Gamal 39
applied
Data Discretization Methods
Split (top-down): Start with all data in one interval and split it into smaller intervals based on
certain criteria.
Merge (bottom-up): Start with each data point in its own interval and merge them into larger
intervals based on certain criteria.

● Typical methods: All the methods can be applied


recursively
○ Clustering analysis (unsupervised, top-down split or
bottom-up merge)
○ Decision-tree analysis (supervised, top-down split)
○ Correlation (e.g., X2) analysis
(unsupervised/supervised, bottom-up merge)
○ Binning
■ Top-down split (bins are decided before
applying to data points), unsupervised
○ Histogram analysis
■ Top-down split, unsupervised
CS322 – Data Analysis Mustafa Elattar and Noha Gamal 40
Data Discretization Methods
Using the correlation allowed us to create meaningful
categories for "Study Hours" based on how they influenced
"Exam Score" . Bottom-up as each point was investigated
alone then merged with similar points later

Clustering (unsupervised,
bottom-up merge)

correlation (supervised, bottom-


up merge), r=0.8
CS322 – Data Analysis Mustafa Elattar and Noha Gamal 41
Binning

● Equal-width (distance) partitioning


○ Divides the range into N intervals (bin) of equal size: uniform grid
○ if A and B are the lowest and highest values of the attribute, the width of
intervals (bins) will be: W = (B–A)/N.
○ The most straightforward, but outliers may dominate presentation
○ Skewed data is not handled well
● Equal-depth (frequency) partitioning
○ Divides the range into B bins or intervals, each containing approximately
same number of samples, if Number of samples= S, then the expected
depth (frequency) will be S/B
○ Good data scaling
○ Managing categorical attributes can be tricky

CS322 – Data Analysis Mustafa Elattar and Noha Gamal 42


Binning Methods for Data Smoothing

● Sorted data for price (in dollars): 4, 8, 9, 15, 21, 21, 24, 25, 26, 28, 29, 34
● Partition into equal-frequency (equi-depth) bins:
○ Bin 1: 4, 8, 9, 15
○ Bin 2: 21, 21, 24, 25
○ Bin 3: 26, 28, 29, 34
● Smoothing by bin means:
○ Bin 1: 9, 9, 9, 9
○ Bin 2: 23, 23, 23, 23
○ Bin 3: 29, 29, 29, 29
● Smoothing by bin boundaries:
○ Bin 1: 4, 4, 4, 15
○ Bin 2: 21, 21, 25, 25
○ Bin 3: 26, 26, 26, 34
CS322 – Data Analysis Mustafa Elattar and Noha Gamal 43
Binning vs. Clustering

Data Equal interval width (binning)

Equal frequency (binning) K-means clustering leads to better results


CS322 – Data Analysis Mustafa Elattar and Noha Gamal 44
Outline

● Data Pre-processing: An Overview

○ Data Quality
○ Major Tasks in Data Pre-processing
● Data Cleaning

● Data Integration

● Data Reduction

● Data Transformation and Data Discretization


● Summary

CS322 – Data Analysis Mustafa Elattar and Noha Gamal 45


Summary

● Data quality: accuracy, ● Data reduction


completeness, consistency,
○ Dimensionality reduction
timeliness, believability,
interpretability ○ Numerosity reduction
● Data cleaning: e.g. missing/noisy ○ Data compression
values, outliers
● Data transformation and data
● Data integration from multiple discretization
sources:
○ Normalization
○ Entity identification problem
○ Concept hierarchy
○ Remove redundancies generation
○ Detect inconsistencies

46

You might also like