0% found this document useful (0 votes)
5 views4 pages

Example Data mining

The document outlines a dataset with two features, Age and Income, and demonstrates the process of normalizing the data using Min-Max and Z-Score methods. It explains the importance of scaling in distance-based algorithms like KNN and clustering, ensuring that both features contribute equally to distance calculations. The final section provides examples of how scaling affects classification and clustering outcomes.

Uploaded by

Muhammad Waleed
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
5 views4 pages

Example Data mining

The document outlines a dataset with two features, Age and Income, and demonstrates the process of normalizing the data using Min-Max and Z-Score methods. It explains the importance of scaling in distance-based algorithms like KNN and clustering, ensuring that both features contribute equally to distance calculations. The final section provides examples of how scaling affects classification and clustering outcomes.

Uploaded by

Muhammad Waleed
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 4

Example Dataset

Let’s say we have a small dataset with two features: Age and Income.

Perso Ag Income (in


n e thousands)
A 25 50
B 30 60
C 35 80
D 40 100

Step 1: Min-Max Normalization

Goal: Scale the data to a range of [0, 1].

Formula:

Xnormalized=X−XminXmax−XminXnormalized=Xmax−XminX−Xmin
Step-by-Step Calculation:

1. Find Min and Max for Each Feature:


oAge: Xmin=25Xmin=25, Xmax=40Xmax=40
o Income: Xmin=50Xmin=50, Xmax=100Xmax=100
2. Normalize Age:
oFor Person A: 25−2540−25=040−2525−25=0
o For Person B: 30−2540−25=0.3340−2530−25=0.33
o For Person C: 35−2540−25=0.6740−2535−25=0.67
o For Person D: 40−2540−25=140−2540−25=1
3. Normalize Income:
o For Person A: 50−50100−50=0100−5050−50=0
o For Person B: 60−50100−50=0.2100−5060−50=0.2
o For Person C: 80−50100−50=0.6100−5080−50=0.6
o For Person D: 100−50100−50=1100−50100−50=1
4. Normalized Dataset:
Person Age (Normalized) Income (Normalized)
A 0 0
B 0.33 0.2
C 0.67 0.6
D 1 1

Step 2: Z-Score Normalization (Standardization)

Goal: Center the data around 0 with a standard deviation of 1.

Formula:

Xstandardized=X−μσXstandardized=σX−μ
 μμ = mean, σσ = standard deviation.

Step-by-Step Calculation:

1. Calculate Mean (μμ) and Standard Deviation (σσ) for Each


Feature:
o Age:
 Mean: μ=25+30+35+404=32.5μ=425+30+35+40=32.5
 Standard Deviation: σ=6.45σ=6.45
o Income:
 Mean: μ=50+60+80+1004=72.5μ=450+60+80+100=72.5
Standard Deviation: σ=21.02σ=21.02
2. Standardize Age:
o For Person A: 25−32.56.45=−1.166.4525−32.5=−1.16
o For Person B: 30−32.56.45=−0.396.4530−32.5=−0.39
o For Person C: 35−32.56.45=0.396.4535−32.5=0.39
o For Person D: 40−32.56.45=1.166.4540−32.5=1.16
3. Standardize Income:
o For Person A: 50−72.521.02=−1.0721.0250−72.5=−1.07
o For Person B: 60−72.521.02=−0.5921.0260−72.5=−0.59
o For Person C: 80−72.521.02=0.3621.0280−72.5=0.36
o For Person D: 100−72.521.02=1.3121.02100−72.5=1.31
4. Standardized Dataset:
Perso
Age (Standardized) Income (Standardized)
n
A -1.16 -1.07
B -0.39 -0.59
C 0.39 0.36
D 1.16 1.31

Step 3: Impact on Distance-Based Algorithms

Why Scaling Matters:

 Without Scaling:
o Income (range: 50-100) dominates Age (range: 25-40) in
distance calculations.
o Algorithms like KNN and clustering will be biased toward
Income.
 With Scaling:
o Both features contribute equally to distance calculations.
o Improves accuracy and fairness in predictions.

Example: KNN

 Suppose we want to classify a new person with Age = 28 and


Income = 55.
 Using the normalized data, distances will be calculated fairly
between Age and Income.

Example: Clustering

 Clusters will group people based on patterns, not just Income.


 For example, younger people with lower incomes will form a
distinct cluster

You might also like