0% found this document useful (0 votes)
14 views71 pages

Working With Data

Uploaded by

Huda Marwan
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
14 views71 pages

Working With Data

Uploaded by

Huda Marwan
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 71

COE101

Introductory
Artificial Intelligence
College of Engineering

Working with Data

Mohammed Ghazal
Marah Alhalabi
Maha Yaghi
Abdalla Gad
How do they relate?
Artificial intelligence is the name of a whole knowledge
field, similar to biology or chemistry
Machine Learning is a part of AI. AI by data. Grows
Neural Networks as a ML model (has representation,
optimization, and evaluation)
Deep Learning is class of NN. A different architecture or
way of connecting the neurons (wiring the brain) 
Easier to solve, easier to scale
COE101 Breakout Session 1

https://tinyurl.com/COE101-Breakout1
Outline
What is Data? What are the types of data?
Data Collection
Dealing with Data Problems
 Data Privacy and Protection  Encryption, NDAs, Regulation
 Lack of Motivation, and Fatigue  Incentives, Agreements, Crowd Sourcing
 Missing Data  Interpolation, Elimination, New Labeling
 Bias  Detection, Balance Data
 Outliers and Noise  Outliers Detection, Outliers Exclusion, Filtering
 Duplicates  Duplicates Detection and Removal
 Structural Errors  Fixing Capitalization Errors, Fixing Spelling, Merging
 Irrelevant Data  Ethical Elimination
 Poor Data Quality and Reduced Variability  Data Augmentation, Enhancement
 Feature Scale Imbalance, Data Type Issues  Data Transformation
What is Data?
• Data typically presents as a table
• Single row: instance, sample, record, observation
• Single cell in the row: attribute, factor, feature
• Datasets are a collection of rows “instances”
• Datasets are used to train and tests AI algorithms
• Comes in the form of tables, media folders, or both
What is Data?

Record/Instance/Sample Feature 1 Feature 2 Name Label


No. Name
Template

Record/Instance/Sample 1 Feature 1 Value Feature 2 Value Label Value 1


for Sample 1 for Sample 1
Record/Instance/Sample 2 Feature 1 Value Feature 2 Value Label Value 2
for Sample 2 for Sample 2

Patient Number Weight Glucose Level Pre-diabetic


(Sample ID) (Feature 1) (Feature 2) (Class/Label/Target)
Example

Patient 1 85 90 No

Patient 2 110 120 Yes


What are the types of data?
Numerical Data: measurable data
Examples: height, weight, price. Can you average? Can you sort?

Categorical Data: can be grouped by a defining characteristic


Examples: gender, nationality, high-school curriculum. Can you group?

Ordinal Data: Mix of numerical and categorical data. Can be categories, but order has meaning.
Examples: rating, educational-level, rank. Can you sort?

Time Series Data: Data Points vs. Time


Examples: Stock market indicators, video, profits over time. Is it an f(t)?

Textual Data: Words, sentences, or paragraphs.


Examples: SM posts, documents, papers.

Image (Spatial Data): Matrix or grid of sensor values (light, height, distance)
Example: MRI, thermal images, mobile phone pictures.
What are the types of data?
Investigate Your Data
You need to answer a set of basic questions
• How many observations do I have?
• How many features?
• What are the data types of my features?
• Do I have a target/class variable?
• What are the problems in my data?
• What can I do to improve my data?
• Can I produce features from the data?
Outline
What is Data? What are the types of data?
Data Collection
Dealing with Data Problems
 Data Privacy and Protection  Encryption, NDAs, Regulation
 Lack of Motivation, and Fatigue  Incentives, Agreements, Crowd Sourcing
 Missing Data  Interpolation, Elimination, New Labeling
 Bias  Detection, Balance Data
 Outliers and Noise  Outliers Detection, Outliers Exclusion, Filtering
 Duplicates  Duplicates Detection and Removal
 Structural Errors  Fixing Capitalization Errors, Fixing Spelling, Merging
 Irrelevant Data  Ethical Elimination
 Poor Data Quality and Reduced Variability  Data Augmentation, Enhancement
 Feature Scale Imbalance, Data Type Issues  Data Transformation
Data Collection
1. Collect it yourself
Manual Automatic

Contains far
Cheaper
fewer errors

Gather
Takes more time
everything you
to collect
can find

More expensive
in general.
Data Collection
2. Someone has already collected it for you

Google’s Dataset Microsoft Research Amazon Datasets UCI Machine Government


Search Open Data Learning Repository Datasets
Data Collection
The Size and Quality of a Data Set Matters

Better Data  Better AI

“Garbage in, garbage out”


• Your model is as good as your data
• How do you measure your data set's quality
• How do you improve data quality?
• How much data do you need to get useful results?
Data Collection
Why is Collecting a Good Dataset Important?

The Google Translate team "...one of our most impactful


has more training data than quality advances since neural
they can use. Rather than machine translation has been
tuning their model, the in identifying the best subset
team has earned bigger of our training data to use“
wins by using the best - Software Engineer, Google
features in their data. Translate

"...most of the times when I "Interesting-looking" errors


tried to manually debug are typically caused by the
interesting-looking errors they data. Faulty data may cause
could be traced back to your model to learn the
issues with the training data." wrong patterns, regardless
- Software Engineer, Google of what modeling
Translate techniques you try.
Data Collection
Size of the Data Matters Quality of the Data Matters

• How much? • No use for lot of data if bad  quality matters, too.
 Minimum: trainable Parameters x 10
• Quality dataset is good if it helps you
• Simple AI on good data > Fancy AI on small data
• Improve the quality of your data by dealing with its
 means that a basic artificial intelligence model trained on a large, high-quality
dataset will likely perform better than a sophisticated AI model trained on a small,
problems
low-quality dataset.

• Google trained simple models on large data sets


• What counts as "a lot" of data?
It depends on the project
• Datasets come in a variety of sizes
Data set Size (number of examples)

Iris flower data set 150 (total set)

MovieLens (the 20M data set) 20,000,263 (total set)

Google Gmail SmartReply 238,000,000 (training set)

Google Translate Trillions


Outline
What is Data? What are the types of data?
Data Collection
Dealing with Data Problems
 Data Privacy and Protection  Encryption, NDAs, Regulation
 Lack of Motivation, and Fatigue  Incentives, Agreements, Crowd Sourcing
 Missing Data  Interpolation, Elimination, New Labeling
 Bias  Detection, Balance Data
 Outliers and Noise  Outliers Detection, Outliers Exclusion, Filtering
 Duplicates  Duplicates Detection and Removal
 Structural Errors  Fixing Capitalization Errors, Fixing Spelling, Merging
 Irrelevant Data  Ethical Elimination
 Poor Data Quality and Reduced Variability  Data Augmentation, Enhancement
 Feature Scale Imbalance, Data Type Issues  Data Transformation
Outline
What is Data? What are the types of data?
Data Collection
Dealing with Data Problems
 Data Privacy and Protection  Encryption, NDAs, Regulation
 Lack of Motivation, and Fatigue  Incentives, Agreements, Crowd Sourcing
 Missing Data  Interpolation, Elimination, New Labeling
 Bias  Detection, Balance Data
 Outliers and Noise  Outliers Detection, Outliers Exclusion, Filtering
 Duplicates  Duplicates Detection and Removal
 Structural Errors  Fixing Capitalization Errors, Fixing Spelling, Merging
 Irrelevant Data  Ethical Elimination
 Poor Data Quality and Reduced Variability  Data Augmentation, Enhancement
 Feature Scale Imbalance, Data Type Issues  Data Transformation
Dealing with Data Problems
Data Problem How to Handle the Problem
Data Privacy and Protection Encryption, NDAs, Regulation

• Data privacy protects personal information from • Implement security measures


unauthorized access, use, or disclosure. • Establish clear policies and procedures
• Important for preventing identity theft, maintaining
trust, and complying with regulations. • Seek permissions/approvals
Outline
What is Data? What are the types of data?
Data Collection
Dealing with Data Problems
 Data Privacy and Protection  Encryption, NDAs, Regulation
 Lack of Motivation, and Fatigue  Incentives, Agreements, Crowd Sourcing
 Missing Data  Interpolation, Elimination, New Labeling
 Bias  Detection, Balance Data
 Outliers and Noise  Outliers Detection, Outliers Exclusion, Filtering
 Duplicates  Duplicates Detection and Removal
 Structural Errors  Fixing Capitalization Errors, Fixing Spelling, Merging
 Irrelevant Data  Ethical Elimination
 Poor Data Quality and Reduced Variability  Data Augmentation, Enhancement
 Feature Scale Imbalance, Data Type Issues  Data Transformation
Dealing with Data Problems
Data Problem How to Handle the Problem
Lack of Motivation and Fatigue Incentives, Agreements, Crowd Sourcing, Quality Checks
• Rushing through the survey
• Providing inconsistent answers
• Skimming instructions
• Selecting the same answer for every question
• Providing false or misleading information
Outline
What is Data? What are the types of data?
Data Collection
Dealing with Data Problems
 Data Privacy and Protection  Encryption, NDAs, Regulation
 Lack of Motivation, and Fatigue  Incentives, Agreements, Crowd Sourcing
 Missing Data  Interpolation, Elimination, New Labeling
 Bias  Detection, Balance Data
 Outliers and Noise  Outliers Detection, Outliers Exclusion, Filtering
 Duplicates  Duplicates Detection and Removal
 Structural Errors  Fixing Capitalization Errors, Fixing Spelling, Merging
 Irrelevant Data  Ethical Elimination
 Poor Data Quality and Reduced Variability  Data Augmentation, Enhancement
 Feature Scale Imbalance, Data Type Issues  Data Transformation
Dealing with Data Problems
Data Problem How to Handle the Problem
Missing Data Interpolation, Elimination, flag and fill, New Labeling
• Data that is not present or incomplete in a dataset Interpolation adjust a function to your data and
uses it to estimate/extrapolate the missing data.
• Happens due to:
• data entry errors Simple Interpolation Method: linear interpolation
 draw a line between close values and use it
• sensor malfunction
• incomplete data collection
Patient 1: (age, heart–rate) = (x1, y1) = (63, 90)
Models require complete data to be accurate Patient 2: (age, heart–rate) = (x2, y2) = (65, 93)
If improperly handled, leads to bias and errors Patient 3: (age, heart–rate) = (x3, y3) = (67, ?)
𝑦2 − 𝑦1
𝑦 = ( 𝑥 − 𝑥1 )+ 𝑦 1
𝑥2 − 𝑥 1

93 − 90
𝑦= ( 𝑥 − 63 ) + 90
65 − 63

93 − 90
𝑦= 𝑦 ( 67 −=
63 ) + 909
65 − 63
Dealing with Data Problems
Data Problem How to Handle the Problem
Missing Data Missing You’re essentially adding
• Data that is not present or incomplete in a dataset
categorical
data
Label a new class for the feature.

• Happens due to: Missing This tells the algorithm that


• data entry errors the value was missing
• sensor malfunction
• incomplete data collection This also gets around the
technical requirement for
Models require complete data to be accurate no missing values.

If improperly handled, leads to bias and errors


Missing Flag the observation with
numeric
data
Can’t or an indicator variable of
missingness
Won’t
Estimate  Then, fill the original

flag and fill missing value with 0 just to


meet the technical
requirement of no missing
values
Outline
What is Data? What are the types of data?
Data Collection
Dealing with Data Problems
 Data Privacy and Protection  Encryption, NDAs, Regulation
 Lack of Motivation, and Fatigue  Incentives, Agreements, Crowd Sourcing
 Missing Data  Interpolation, Elimination, New Labeling
 Bias  Detection, Balance Data
 Outliers and Noise  Outliers Detection, Outliers Exclusion, Filtering
 Duplicates  Duplicates Detection and Removal
 Structural Errors  Fixing Capitalization Errors, Fixing Spelling, Merging
 Irrelevant Data  Ethical Elimination
 Poor Data Quality and Reduced Variability  Data Augmentation, Enhancement
 Feature Scale Imbalance, Data Type Issues  Data Transformation
Dealing with Data Problems
Data Problem How to Handle the Problem
Bias Detection, Balance Data
• When data used to train AI does not represent the Balancing data
population it serves
ensures data used to train AI model is representative
• Leads to unfair or discriminatory employment, and diverse  accurate predictions for all
criminal justice, and healthcare outcomes
Outline
What is Data? What are the types of data?
Data Collection
Dealing with Data Problems
 Data Privacy and Protection  Encryption, NDAs, Regulation
 Lack of Motivation, and Fatigue  Incentives, Agreements, Crowd Sourcing
 Missing Data  Interpolation, Elimination, New Labeling
 Bias  Detection, Balance Data
 Outliers and Noise  Outliers Detection, Outliers Exclusion, Filtering
 Duplicates  Duplicates Detection and Removal
 Structural Errors  Fixing Capitalization Errors, Fixing Spelling, Merging
 Irrelevant Data  Ethical Elimination
 Poor Data Quality and Reduced Variability  Data Augmentation, Enhancement
 Feature Scale Imbalance, Data Type Issues  Data Transformation
Dealing with Data Problems
Data Problem How to Handle the Problem
Outliers and Noise Outliers Detection, Outliers Exclusion, Filtering
• Observations lying abnormally far from others • Removing outliers helps your model’s performance
• Due to measurement error, sampling bias, or • Examine your data carefully to decide whether to
natural variation in the data remove a data outlier
• Lead to inaccurate or biased AI that • Never remove an outlier because it is a "big
overemphasizes the influence of the outlier number." That big number could be very informative
Outline
What is Data? What are the types of data?
Data Collection
Dealing with Data Problems
 Data Privacy and Protection  Encryption, NDAs, Regulation
 Lack of Motivation, and Fatigue  Incentives, Agreements, Crowd Sourcing
 Missing Data  Interpolation, Elimination, New Labeling
 Bias  Detection, Balance Data
 Outliers and Noise  Outliers Detection, Outliers Exclusion, Filtering
 Duplicates  Duplicates Detection and Removal
 Structural Errors  Fixing Capitalization Errors, Fixing Spelling, Merging
 Irrelevant Data  Ethical Elimination
 Poor Data Quality and Reduced Variability  Data Augmentation, Enhancement
 Feature Scale Imbalance, Data Type Issues  Data Transformation
Dealing with Data Problems
Data Problem How to Handle the Problem
Duplicates Duplicates Detection and Removal
Duplicate observations most frequently arise
during data collection, such as when you:
• Combine datasets from multiple places
• Scrape data
• Receive data from clients/other departments
Outline
What is Data? What are the types of data?
Data Collection
Dealing with Data Problems
 Data Privacy and Protection  Encryption, NDAs, Regulation
 Lack of Motivation, and Fatigue  Incentives, Agreements, Crowd Sourcing
 Missing Data  Interpolation, Elimination, New Labeling
 Bias  Detection, Balance Data
 Outliers and Noise  Outliers Detection, Outliers Exclusion, Filtering
 Duplicates  Duplicates Detection and Removal
 Structural Errors  Fixing Capitalization Errors, Fixing Spelling, Merging
 Irrelevant Data  Ethical Elimination
 Poor Data Quality and Reduced Variability  Data Augmentation, Enhancement
 Feature Scale Imbalance, Data Type Issues  Data Transformation
Dealing with Data Problems
Data Problem How to Handle the Problem
Structural Errors Fixing Capitalization Errors, Fixing Spelling, Merging
Structural errors arise during measurement, data
transfer, or poor housekeeping
Check for:
• Typos
• Inconsistent capitalization.
• Mislabeled classes
Dealing with Data Problems
Data Problem How to Handle the Problem
Structural Errors Fixing Capitalization Errors, Fixing Spelling, Merging
Structural errors arise during measurement, data
transfer, or poor housekeeping
Check for:
• Typos
• Inconsistent capitalization.
• Mislabeled classes
Outline
What is Data? What are the types of data?
Data Collection
Dealing with Data Problems
 Data Privacy and Protection  Encryption, NDAs, Regulation
 Lack of Motivation, and Fatigue  Incentives, Agreements, Crowd Sourcing
 Missing Data  Interpolation, Elimination, New Labeling
 Bias  Detection, Balance Data
 Outliers and Noise  Outliers Detection, Outliers Exclusion, Filtering
 Duplicates  Duplicates Detection and Removal
 Structural Errors  Fixing Capitalization Errors, Fixing Spelling, Merging
 Irrelevant Data  Ethical Elimination
 Poor Data Quality and Reduced Variability  Data Augmentation, Enhancement
 Feature Scale Imbalance, Data Type Issues  Data Transformation
Dealing with Data Problems
Data Problem How to Handle the Problem
Irrelevant Data Ethical Elimination
• Observations that don’t fit the specific problem • Checking for irrelevant observations before
• engineering features can save time and effort
For example, building a model for Villas only,
remove apartments • Ethical elimination is a technique used to remove
irrelevant data from the dataset while ensuring that
the ethically sound data and representative of the
population it serves
• Ethical considerations should also consider whether
certain data points may discriminate or harm certain
populations
Outline
What is Data? What are the types of data?
Data Collection
Dealing with Data Problems
 Data Privacy and Protection  Encryption, NDAs, Regulation
 Lack of Motivation, and Fatigue  Incentives, Agreements, Crowd Sourcing
 Missing Data  Interpolation, Elimination, New Labeling
 Bias  Detection, Balance Data
 Outliers and Noise  Outliers Detection, Outliers Exclusion, Filtering
 Duplicates  Duplicates Detection and Removal
 Structural Errors  Fixing Capitalization Errors, Fixing Spelling, Merging
 Irrelevant Data  Ethical Elimination
 Poor Data Quality and Reduced Variability  Data Augmentation, Enhancement
 Feature Scale Imbalance, Data Type Issues  Data Transformation
Dealing with Data Problems
Data Problem How to Handle the Problem
Poor Data Quality and Reduced Variability Data Augmentation, Enhancement
Poor data quality
incomplete, inconsistent, or contains errors
 lead to inaccurate or
biased AI
Reduced variability
Limited in its scope or variety
 does not representative of the population it serves
Dealing with Data Problems
Data Problem How to Handle the Problem
Poor Data Quality and Reduced Variability Data Augmentation, Enhancement
Poor data quality
incomplete, inconsistent, or contains errors
 lead to inaccurate or
biased AI
Reduced variability
Limited in its scope or variety
 does not representative of the population it serves
Dealing with Data Problems
Data Problem How to Handle the Problem
Poor Data Quality and Reduced Variability Data Augmentation, Enhancement
Poor data quality
incomplete, inconsistent, or contains errors Common Augmentation Methods
 lead to inaccurate or
biased AI 1. Mirroring
Reduced variability 2. Random Cropping
Limited in its scope or variety 3. Rotation
 does not representative of the population it serves
4. Shearing
5. Color Shifting
6. Brightness

Original Image Augmented Image


Dealing with Data Problems
Data Problem How to Handle the Problem
Poor Data Quality and Reduced Variability Data Augmentation, Enhancement
Poor data quality
incomplete, inconsistent, or contains errors Common Augmentation Methods
 lead to inaccurate or
biased AI 1. Mirroring
Reduced variability 2. Random Cropping
Limited in its scope or variety 3. Rotation
 does not representative of the population it serves
4. Shearing
5. Color Shifting
6. Brightness

Original Image Augmented Image


Dealing with Data Problems
Data Problem How to Handle the Problem
Poor Data Quality and Reduced Variability Data Augmentation, Enhancement
Poor data quality
incomplete, inconsistent, or contains errors Common Augmentation Methods
 lead to inaccurate or
biased AI 1. Mirroring
Reduced variability 2. Random Cropping
Limited in its scope or variety 3. Rotation
 does not representative of the population it serves
4. Shearing
5. Color Shifting
6. Brightness

Original Image Augmented Image


Dealing with Data Problems
Data Problem How to Handle the Problem
Poor Data Quality and Reduced Variability Data Augmentation, Enhancement
Poor data quality
incomplete, inconsistent, or contains errors Common Augmentation Methods
 lead to inaccurate or
biased AI 1. Mirroring
Reduced variability 2. Random Cropping
Limited in its scope or variety 3. Rotation
 does not representative of the population it serves
4. Shearing
5. Color Shifting
6. Brightness

Original Image Augmented Image


Dealing with Data Problems
Data Problem How to Handle the Problem
Poor Data Quality and Reduced Variability Data Augmentation, Enhancement
Poor data quality
incomplete, inconsistent, or contains errors Common Augmentation Methods
 lead to inaccurate or
biased AI 1. Mirroring
Reduced variability 2. Random Cropping
Limited in its scope or variety 3. Rotation
 does not representative of the population it serves
4. Shearing
5. Color Shifting
6. Brightness

Original Image Augmented Image


Dealing with Data Problems
Data Problem How to Handle the Problem
Poor Data Quality and Reduced Variability Data Augmentation, Enhancement
Poor data quality
incomplete, inconsistent, or contains errors Common Augmentation Methods
 lead to inaccurate or
biased AI 1. Mirroring
Reduced variability 2. Random Cropping
Limited in its scope or variety 3. Rotation
 does not representative of the population it serves
4. Shearing
5. Color Shifting
6. Brightness

Original Image Augmented Image


Outline
What is Data? What are the types of data?
Data Collection
Dealing with Data Problems
 Data Privacy and Protection  Encryption, NDAs, Regulation
 Lack of Motivation, and Fatigue  Incentives, Agreements, Crowd Sourcing
 Missing Data  Interpolation, Elimination, New Labeling
 Bias  Detection, Balance Data
 Outliers and Noise  Outliers Detection, Outliers Exclusion, Filtering
 Duplicates  Duplicates Detection and Removal
 Structural Errors  Fixing Capitalization Errors, Fixing Spelling, Merging
 Irrelevant Data  Ethical Elimination
 Poor Data Quality and Reduced Variability  Data Augmentation, Enhancement
 Feature Scale Imbalance, Data Type Issues  Data Transformation
Dealing with Data Problems
Data Problem How to Handle the Problem
Feature Scale Imbalance, Data Type Issues Data Transformation
• Data type issues are data not in a format usable • One-hot encoding is used in machine learning to
by the AI model  such as categorical data or quantify categorical data.
text data.
• Splitting the column which contains numerical
categorical data into many columns depending on
• the number of categories present in that column.
Features that have significantly different scales
Each column contains “0” or “1” corresponding to
 impact the accuracy of the AI model.
which column it has been placed
Fruit Categorical Value Price
Apple 1 5
Mango 2 10
Apple 1 15
Orange 3 20

Apple Mango Orange Price


1 0 0 5
0 1 0 10
1 0 0 15
0 0 1 20
Data Transformation

Data Distribution/ Histogram

Categorical

Ordinal Numerical
Data Transformation

Data Normalization
• Transform data features to be on a similar scale which improves the performance and training
stability of the machine learning model.
• Normalization is useful when your data has varying scales, and the algorithm you are using does not
make assumptions about the distribution of your data.
Data Transformation

Data Normalization
Normalization Techniques
Data Transformation

Data Normalization
Log Scaling
Data Transformation

Data Normalization
Feature Clipping
Data Transformation

Data Normalization
Z-Score
Data Transformation

Data Normalization
Linear Scaling vs. Z-Score
Data Transformation

Data Normalization
Linear Scaling
Age Income #Credit Buy
Cards Insurance Age Range = 23 to 65 (42 years)
35 15,000 1 No Income Range = 7k to 45k (AED 38,000)
45 32,000 3 No #Credit Cards Range = 0-4 (4 credit cards)
23 7,000 4 No

55 45,000 3 Yes
Normalization Needed !!
65 12,000 0 Yes

27 20,000 2 No

33 25,000 2 Yes
Data Transformation

Linear Scaling
Age min = 23 Age max = 65 Age max - Age min = 42
Age Age – Age min Age – Age min Age'
------------------
Age max - Age min
35 35-23 = 12 12/42 = 0.29 0.29

45 45-23 = 22 22/42 = 0.52 0.52

23 23-23 = 0 00/42 = 0.00 0.00

55 55-23 = 32 32/42 = 0.76 0.76

65 65-23 = 42 42/42 = 1.00 1.00

27 27-23 = 4 04/42 = 0.10 0.10

33 33-23 =10 10/42 = 0.24 0.24


Data Transformation

Linear Scaling
Income min = 7k Income max = 45k Income max - Income min = 38k
Income Income – Income min Income – Income min Income'
------------------
Income max - Income min
15,000 8,000 8,000/38,000 0.21
32,000 25,000 25,000/38,000 0.66
7,000 0 0/38,000 0.00
45,000 38,000 38,000/38,000 1.00
12,000 5,000 5,000/38,000 0.13
20,000 13,000 13,000/38,000 0.34
25,000 18,000 18,000/38,000 0.47
Data Transformation

Linear Scaling
#Credit Cards min = 0 #Credit Cards max = 4 #Credit Cards max - #Credit Cards min = 4
#Credit Cards #Credit Cards – #Credit Cards – #Credit Cards min #Credit
#Credit Cards min ------------------ Cards '
#Credit Cards max - #Credit Cards min
1 1 1/4 0.25
3 3 3/4 0.75
4 4 4/4 1.00
3 3 3/4 0.75
0 0 0/4 0.00
2 2 2/4 0.50
2 2 2/4 0.50
Data Transformation

Linear Scaling
Age’ Income’ #Credit Buy
Cards’ Insurance Age Range = 23 to 65 (42 years)
0.29 No Income Range = 7k to 45k (AED 38,000)
0.21 0.25
#Credit Cards Range = 0-4 (4 credit cards)
0.52 0.66 0.75 No

0.00 0.00 1.00 No

0.76 1.00 0.75 Yes Normalization Completed !!


1.00 0.13 0.00 Yes

0.10 0.34 0.50 No

0.24 0.47 0.50 Yes


Data Transformation

Data Normalization
Clipping
Room People Inside Humidity Cooling
Temperature (C) (%) Needed
23.2 30 100 High Room Temperature = 16-27 degrees
People Inside = 0-50
24.8 150 65 Medium
Humidity = 60 - 85
22.1 20 67 Low

23.7 30 70 Medium
Normalization Needed !!
44.3 40 80 High

22.5 20 69 Low

14 -10 73 Low
Data Transformation

Data Normalization
Clipping
Room People Inside Humidity Cooling
Temperature (C) (%) Needed
23.2 30 100 High Room Temperature = 16-27 degrees
People Inside = 0-50
24.8 150 65 Medium
Humidity = 60 - 85
22.1 20 67 Low

23.7 30 70 Medium
Normalization Needed !!
44.3 40 80 High

22.5 20 69 Low

14 -10 73 Low
Data Transformation

Data Normalization
Clipping
Room People Humidity’ Cooling
Temperature’ (C) Inside’ (%) Needed
23.2 30 100  85 High Room Temperature = 16-27 degrees
People Inside = 0-50
24.8 150  50 65 Medium
Humidity = 60 - 85
22.1 20 67 Low

23.7 30 70 Medium
Normalization Needed !!
43.3 27.0 40 80 High

22.5 20 69 Low

14.0  16.0 -10  0 73 Low


Data Transformation

Data Normalization
Clipping
Room People Humidity’ Cooling
Temperature’ (C) Inside’ (%) Needed
23.2 30 85 High
Room Temperature = 16-27 degrees
People Inside = 0-50
24.8 50 65 Medium Humidity = 60 - 85
22.1 20 67 Low

23.7 30 70 Medium Normalization Completed !!


27.0 40 80 High

22.5 20 69 Low

16.0 0 73 Low
Data Transformation

Data Normalization
Z-Score
Age Income #Credit Buy
Cards Insurance Age Range = 23 to 65 (42 years)
35 15,000 1 No Income Range = 7k to 45k (AED 38,000)
45 32,000 3 No #Credit Cards Range = 0-4 (4 credit cards)
23 7,000 4 No Normalization Needed !!
55 45,000 3 Yes

65 12,000 0 Yes

27 20,000 2 No

33 25,000 2 Yes
Data Transformation

Z-Score
Age Age – Age mean Age – Age mean Age'
------------------
Age mean = 40.43 Age std dev
Age’ mean = 0
35 -5.43 -0.35 -0.35
Age std dev = 15.31 Age’ std dev = 1
45 4.57 0.30 0.30
23 -17.43 -1.14 -1.14
55 14.57 0.95 0.95
65 24.57 1.61 1.61
27 -13.43 -0.88 -0.88
33 -7.43 -0.49 -0.49
Clean your Data! Years Of Position Salary (k AED)
Experience
Activity 1 Staff 8
1. Fill the missing Data using 2 staff 11
Interpolation
2. Remove Duplicate Observations
3 Staff _
3. Fix Structural Errors 4 Staff 17
4. Remove Outliers
3 Staff 14
5. Apply One Hot Encoding
6 Staff _
7 Staff 26
7 Manager 20
8 Supervisr 30
9 Supervisor 33
Years Of Staff Supervisor Manager Salary (k
Clean your Data! Experience AED)
1 1 0 0 8
Activity
2 1 0 0 11
1. Fill the missing Data using
Interpolation 3 1 0 0 14
2. Remove Duplicate Observations 4 1 0 0 17
3. Fix Structural Errors: Supervisr, staff
3 1 0 0 14
4. Remove Outliers
5. Apply One Hot Encoding 6 1 0 0 23
7 1 0 0 26
7 0 0 1 20
8 0 1 0 30
9 0 1 0 33
Data Normalization
Linear Scaling in Excel
Age Income #Credit Buy
Cards Insurance Age Range = 23 to 65 (42 years)
35 15,000 1 No Income Range = 7k to 45k (AED 38,000)
45 32,000 3 No #Credit Cards Range = 0-4 (4 credit cards)
23 7,000 4 No

55 45,000 3 Yes
Normalization Needed !!
65 12,000 0 Yes

27 20,000 2 No

33 25,000 2 Yes

You might also like