CS322 Lec5 S25
CS322 Lec5 S25
CSCI322: Data
Data
Analysis
Lecture 1:
5: Data Types,
Preprocessing
Collection,
(Part 3) Sampling
Dr.
Dr. Noha
Noha Gamal,
Gamal, Dr.
Dr. Mona
Shimaa
Arafa, and Dr. Mustafa Elattar
Mohamed
Outline
○ Data Quality
● Major Tasks in Data Pre-processing
o Data Cleaning
o Data Integration
o Data Reduction
5 2 2 5 4 ... Positive
example, the contingency table for the
"Excellent" feature would look like this: Review
ID Excellent Sentiment
Sentiment Sentiment
1 3 Positive
= Positive = Negative
2 0 Negative
Excellent =
0 (count) 2 (count) 3 4 Positive
0
4 0 Negative
Excellent > 5 2 Positive
3 (count) 0 (count)
0
A4 ?
A1? A6?
Initial attribute set: {Height, Weight, Age, Job, Marital Status, Number of Kids}, Target {Gender}
Height
>=165
Y
Weight>=80 Weight>=70
Y
Y
Redundant Irrelevant
(OVERFITTING)
Gini=(1-(3/4)^2-(1/4)^2)
Y
N
This segment is not
representing a pure class
Gini=(1-(1/2)^2-(1/2)^2)
Gini=(1-(1/2)^2-(1/2)^2)
Attendance
<5
Y
N
Fail Pass
● Example: Regression
● Imagine a dataset with a linear trend. Instead of storing every data point,
we can fit a linear regression line to the data and store only the slope and
intercept of the line. (y=ax+b), we save only, a, and b (that’s why it is called
parametric)
In summary, the idea behind
data point reduction using
regression is to leverage the
trend-capturing ability of
regression models to represent
large datasets with fewer, but
still representative, data points.
● 120 elements
Scores.sort() = [2, 2, 2, 2, 2, 2, 2, 3, 3, 3,
3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3,
3, 3, 4, 4, 5, 5, 5, 5, 5, 5, 5, 5, 6, 6,
6, 6, 7, 7, 7, 8, 9, 9, 9, 12, 12, 12, 13,
13, 13, 13, 13, 13, 13, 13, 13, 14, 14, 14,
14, 14, 14, 14, 14, 14, 14, 14, 14, 15, 16,
16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16,
17, 17, 17, 17, 18, 18, 18, 18, 18, 18, 18,
18, 18, 18, 18, 18, 18, 18, 18, 18, 18, 19,
19, 19, 19, 19, 19, 20, 20, 20, 20, 20, 20,
20, 20, 20]
Raw Data
CS322 – Data Analysis Mustafa Elattar and Noha Gamal 30
Data Reduction: 4. Data Compression
32
4- Data Compression
● String compression
○ This is specifically for compressing textual data.
There are various algorithms, such as Huffman Example using RLE: Original String:
coding, Run-Length Encoding (RLE), and "WWWWWWWWXXZZZ"
Lempel-Ziv-Welch (LZW). Compressed String: "8W2X3Z"
○ Data Quality
○ Major Tasks in Data Pre-processing
● Data Cleaning
● Data Integration
● Data Reduction
Clustering (unsupervised,
bottom-up merge)
● Sorted data for price (in dollars): 4, 8, 9, 15, 21, 21, 24, 25, 26, 28, 29, 34
● Partition into equal-frequency (equi-depth) bins:
○ Bin 1: 4, 8, 9, 15
○ Bin 2: 21, 21, 24, 25
○ Bin 3: 26, 28, 29, 34
● Smoothing by bin means:
○ Bin 1: 9, 9, 9, 9
○ Bin 2: 23, 23, 23, 23
○ Bin 3: 29, 29, 29, 29
● Smoothing by bin boundaries:
○ Bin 1: 4, 4, 4, 15
○ Bin 2: 21, 21, 25, 25
○ Bin 3: 26, 26, 26, 34
CS322 – Data Analysis Mustafa Elattar and Noha Gamal 43
Binning vs. Clustering
○ Data Quality
○ Major Tasks in Data Pre-processing
● Data Cleaning
● Data Integration
● Data Reduction
46