Working With Data
Working With Data
Introductory
Artificial Intelligence
College of Engineering
Mohammed Ghazal
Marah Alhalabi
Maha Yaghi
Abdalla Gad
How do they relate?
Artificial intelligence is the name of a whole knowledge
field, similar to biology or chemistry
Machine Learning is a part of AI. AI by data. Grows
Neural Networks as a ML model (has representation,
optimization, and evaluation)
Deep Learning is class of NN. A different architecture or
way of connecting the neurons (wiring the brain)
Easier to solve, easier to scale
COE101 Breakout Session 1
https://tinyurl.com/COE101-Breakout1
Outline
What is Data? What are the types of data?
Data Collection
Dealing with Data Problems
Data Privacy and Protection Encryption, NDAs, Regulation
Lack of Motivation, and Fatigue Incentives, Agreements, Crowd Sourcing
Missing Data Interpolation, Elimination, New Labeling
Bias Detection, Balance Data
Outliers and Noise Outliers Detection, Outliers Exclusion, Filtering
Duplicates Duplicates Detection and Removal
Structural Errors Fixing Capitalization Errors, Fixing Spelling, Merging
Irrelevant Data Ethical Elimination
Poor Data Quality and Reduced Variability Data Augmentation, Enhancement
Feature Scale Imbalance, Data Type Issues Data Transformation
What is Data?
• Data typically presents as a table
• Single row: instance, sample, record, observation
• Single cell in the row: attribute, factor, feature
• Datasets are a collection of rows “instances”
• Datasets are used to train and tests AI algorithms
• Comes in the form of tables, media folders, or both
What is Data?
Patient 1 85 90 No
Ordinal Data: Mix of numerical and categorical data. Can be categories, but order has meaning.
Examples: rating, educational-level, rank. Can you sort?
Image (Spatial Data): Matrix or grid of sensor values (light, height, distance)
Example: MRI, thermal images, mobile phone pictures.
What are the types of data?
Investigate Your Data
You need to answer a set of basic questions
• How many observations do I have?
• How many features?
• What are the data types of my features?
• Do I have a target/class variable?
• What are the problems in my data?
• What can I do to improve my data?
• Can I produce features from the data?
Outline
What is Data? What are the types of data?
Data Collection
Dealing with Data Problems
Data Privacy and Protection Encryption, NDAs, Regulation
Lack of Motivation, and Fatigue Incentives, Agreements, Crowd Sourcing
Missing Data Interpolation, Elimination, New Labeling
Bias Detection, Balance Data
Outliers and Noise Outliers Detection, Outliers Exclusion, Filtering
Duplicates Duplicates Detection and Removal
Structural Errors Fixing Capitalization Errors, Fixing Spelling, Merging
Irrelevant Data Ethical Elimination
Poor Data Quality and Reduced Variability Data Augmentation, Enhancement
Feature Scale Imbalance, Data Type Issues Data Transformation
Data Collection
1. Collect it yourself
Manual Automatic
Contains far
Cheaper
fewer errors
Gather
Takes more time
everything you
to collect
can find
More expensive
in general.
Data Collection
2. Someone has already collected it for you
• How much? • No use for lot of data if bad quality matters, too.
Minimum: trainable Parameters x 10
• Quality dataset is good if it helps you
• Simple AI on good data > Fancy AI on small data
• Improve the quality of your data by dealing with its
means that a basic artificial intelligence model trained on a large, high-quality
dataset will likely perform better than a sophisticated AI model trained on a small,
problems
low-quality dataset.
93 − 90
𝑦= ( 𝑥 − 63 ) + 90
65 − 63
93 − 90
𝑦= 𝑦 ( 67 −=
63 ) + 909
65 − 63
Dealing with Data Problems
Data Problem How to Handle the Problem
Missing Data Missing You’re essentially adding
• Data that is not present or incomplete in a dataset
categorical
data
Label a new class for the feature.
Categorical
Ordinal Numerical
Data Transformation
Data Normalization
• Transform data features to be on a similar scale which improves the performance and training
stability of the machine learning model.
• Normalization is useful when your data has varying scales, and the algorithm you are using does not
make assumptions about the distribution of your data.
Data Transformation
Data Normalization
Normalization Techniques
Data Transformation
Data Normalization
Log Scaling
Data Transformation
Data Normalization
Feature Clipping
Data Transformation
Data Normalization
Z-Score
Data Transformation
Data Normalization
Linear Scaling vs. Z-Score
Data Transformation
Data Normalization
Linear Scaling
Age Income #Credit Buy
Cards Insurance Age Range = 23 to 65 (42 years)
35 15,000 1 No Income Range = 7k to 45k (AED 38,000)
45 32,000 3 No #Credit Cards Range = 0-4 (4 credit cards)
23 7,000 4 No
55 45,000 3 Yes
Normalization Needed !!
65 12,000 0 Yes
27 20,000 2 No
33 25,000 2 Yes
Data Transformation
Linear Scaling
Age min = 23 Age max = 65 Age max - Age min = 42
Age Age – Age min Age – Age min Age'
------------------
Age max - Age min
35 35-23 = 12 12/42 = 0.29 0.29
Linear Scaling
Income min = 7k Income max = 45k Income max - Income min = 38k
Income Income – Income min Income – Income min Income'
------------------
Income max - Income min
15,000 8,000 8,000/38,000 0.21
32,000 25,000 25,000/38,000 0.66
7,000 0 0/38,000 0.00
45,000 38,000 38,000/38,000 1.00
12,000 5,000 5,000/38,000 0.13
20,000 13,000 13,000/38,000 0.34
25,000 18,000 18,000/38,000 0.47
Data Transformation
Linear Scaling
#Credit Cards min = 0 #Credit Cards max = 4 #Credit Cards max - #Credit Cards min = 4
#Credit Cards #Credit Cards – #Credit Cards – #Credit Cards min #Credit
#Credit Cards min ------------------ Cards '
#Credit Cards max - #Credit Cards min
1 1 1/4 0.25
3 3 3/4 0.75
4 4 4/4 1.00
3 3 3/4 0.75
0 0 0/4 0.00
2 2 2/4 0.50
2 2 2/4 0.50
Data Transformation
Linear Scaling
Age’ Income’ #Credit Buy
Cards’ Insurance Age Range = 23 to 65 (42 years)
0.29 No Income Range = 7k to 45k (AED 38,000)
0.21 0.25
#Credit Cards Range = 0-4 (4 credit cards)
0.52 0.66 0.75 No
Data Normalization
Clipping
Room People Inside Humidity Cooling
Temperature (C) (%) Needed
23.2 30 100 High Room Temperature = 16-27 degrees
People Inside = 0-50
24.8 150 65 Medium
Humidity = 60 - 85
22.1 20 67 Low
23.7 30 70 Medium
Normalization Needed !!
44.3 40 80 High
22.5 20 69 Low
14 -10 73 Low
Data Transformation
Data Normalization
Clipping
Room People Inside Humidity Cooling
Temperature (C) (%) Needed
23.2 30 100 High Room Temperature = 16-27 degrees
People Inside = 0-50
24.8 150 65 Medium
Humidity = 60 - 85
22.1 20 67 Low
23.7 30 70 Medium
Normalization Needed !!
44.3 40 80 High
22.5 20 69 Low
14 -10 73 Low
Data Transformation
Data Normalization
Clipping
Room People Humidity’ Cooling
Temperature’ (C) Inside’ (%) Needed
23.2 30 100 85 High Room Temperature = 16-27 degrees
People Inside = 0-50
24.8 150 50 65 Medium
Humidity = 60 - 85
22.1 20 67 Low
23.7 30 70 Medium
Normalization Needed !!
43.3 27.0 40 80 High
22.5 20 69 Low
Data Normalization
Clipping
Room People Humidity’ Cooling
Temperature’ (C) Inside’ (%) Needed
23.2 30 85 High
Room Temperature = 16-27 degrees
People Inside = 0-50
24.8 50 65 Medium Humidity = 60 - 85
22.1 20 67 Low
22.5 20 69 Low
16.0 0 73 Low
Data Transformation
Data Normalization
Z-Score
Age Income #Credit Buy
Cards Insurance Age Range = 23 to 65 (42 years)
35 15,000 1 No Income Range = 7k to 45k (AED 38,000)
45 32,000 3 No #Credit Cards Range = 0-4 (4 credit cards)
23 7,000 4 No Normalization Needed !!
55 45,000 3 Yes
65 12,000 0 Yes
27 20,000 2 No
33 25,000 2 Yes
Data Transformation
Z-Score
Age Age – Age mean Age – Age mean Age'
------------------
Age mean = 40.43 Age std dev
Age’ mean = 0
35 -5.43 -0.35 -0.35
Age std dev = 15.31 Age’ std dev = 1
45 4.57 0.30 0.30
23 -17.43 -1.14 -1.14
55 14.57 0.95 0.95
65 24.57 1.61 1.61
27 -13.43 -0.88 -0.88
33 -7.43 -0.49 -0.49
Clean your Data! Years Of Position Salary (k AED)
Experience
Activity 1 Staff 8
1. Fill the missing Data using 2 staff 11
Interpolation
2. Remove Duplicate Observations
3 Staff _
3. Fix Structural Errors 4 Staff 17
4. Remove Outliers
3 Staff 14
5. Apply One Hot Encoding
6 Staff _
7 Staff 26
7 Manager 20
8 Supervisr 30
9 Supervisor 33
Years Of Staff Supervisor Manager Salary (k
Clean your Data! Experience AED)
1 1 0 0 8
Activity
2 1 0 0 11
1. Fill the missing Data using
Interpolation 3 1 0 0 14
2. Remove Duplicate Observations 4 1 0 0 17
3. Fix Structural Errors: Supervisr, staff
3 1 0 0 14
4. Remove Outliers
5. Apply One Hot Encoding 6 1 0 0 23
7 1 0 0 26
7 0 0 1 20
8 0 1 0 30
9 0 1 0 33
Data Normalization
Linear Scaling in Excel
Age Income #Credit Buy
Cards Insurance Age Range = 23 to 65 (42 years)
35 15,000 1 No Income Range = 7k to 45k (AED 38,000)
45 32,000 3 No #Credit Cards Range = 0-4 (4 credit cards)
23 7,000 4 No
55 45,000 3 Yes
Normalization Needed !!
65 12,000 0 Yes
27 20,000 2 No
33 25,000 2 Yes