0% found this document useful (0 votes)

14 views71 pages

Working With Data

Uploaded by

Huda Marwan

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

14 views71 pages

Working With Data

Uploaded by

Huda Marwan

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

You are on page 1/ 71

COE101

Introductory
Artificial Intelligence
College of Engineering

Working with Data

Mohammed Ghazal
Marah Alhalabi
Maha Yaghi
Abdalla Gad
How do they relate?
Artificial intelligence is the name of a whole knowledge
field, similar to biology or chemistry
Machine Learning is a part of AI. AI by data. Grows
Neural Networks as a ML model (has representation,
optimization, and evaluation)
Deep Learning is class of NN. A different architecture or
way of connecting the neurons (wiring the brain) 
Easier to solve, easier to scale
COE101 Breakout Session 1

https://tinyurl.com/COE101-Breakout1
Outline
What is Data? What are the types of data?
Data Collection
Dealing with Data Problems
 Data Privacy and Protection  Encryption, NDAs, Regulation
 Lack of Motivation, and Fatigue  Incentives, Agreements, Crowd Sourcing
 Missing Data  Interpolation, Elimination, New Labeling
 Bias  Detection, Balance Data
 Outliers and Noise  Outliers Detection, Outliers Exclusion, Filtering
 Duplicates  Duplicates Detection and Removal
 Structural Errors  Fixing Capitalization Errors, Fixing Spelling, Merging
 Irrelevant Data  Ethical Elimination
 Poor Data Quality and Reduced Variability  Data Augmentation, Enhancement
 Feature Scale Imbalance, Data Type Issues  Data Transformation
What is Data?
• Data typically presents as a table
• Single row: instance, sample, record, observation
• Single cell in the row: attribute, factor, feature
• Datasets are a collection of rows “instances”
• Datasets are used to train and tests AI algorithms
• Comes in the form of tables, media folders, or both
What is Data?

Record/Instance/Sample Feature 1 Feature 2 Name Label

No. Name
Template

Record/Instance/Sample 1 Feature 1 Value Feature 2 Value Label Value 1

for Sample 1 for Sample 1
Record/Instance/Sample 2 Feature 1 Value Feature 2 Value Label Value 2
for Sample 2 for Sample 2

Patient Number Weight Glucose Level Pre-diabetic

(Sample ID) (Feature 1) (Feature 2) (Class/Label/Target)
Example

Patient 1 85 90 No

Patient 2 110 120 Yes

What are the types of data?
Numerical Data: measurable data
Examples: height, weight, price. Can you average? Can you sort?

Categorical Data: can be grouped by a defining characteristic

Examples: gender, nationality, high-school curriculum. Can you group?

Ordinal Data: Mix of numerical and categorical data. Can be categories, but order has meaning.
Examples: rating, educational-level, rank. Can you sort?

Time Series Data: Data Points vs. Time

Examples: Stock market indicators, video, profits over time. Is it an f(t)?

Textual Data: Words, sentences, or paragraphs.

Examples: SM posts, documents, papers.

Image (Spatial Data): Matrix or grid of sensor values (light, height, distance)
Example: MRI, thermal images, mobile phone pictures.
What are the types of data?
Investigate Your Data
You need to answer a set of basic questions
• How many observations do I have?
• How many features?
• What are the data types of my features?
• Do I have a target/class variable?
• What are the problems in my data?
• What can I do to improve my data?
• Can I produce features from the data?
Outline
What is Data? What are the types of data?
Data Collection
Dealing with Data Problems
 Data Privacy and Protection  Encryption, NDAs, Regulation
 Lack of Motivation, and Fatigue  Incentives, Agreements, Crowd Sourcing
 Missing Data  Interpolation, Elimination, New Labeling
 Bias  Detection, Balance Data
 Outliers and Noise  Outliers Detection, Outliers Exclusion, Filtering
 Duplicates  Duplicates Detection and Removal
 Structural Errors  Fixing Capitalization Errors, Fixing Spelling, Merging
 Irrelevant Data  Ethical Elimination
 Poor Data Quality and Reduced Variability  Data Augmentation, Enhancement
 Feature Scale Imbalance, Data Type Issues  Data Transformation
Data Collection
1. Collect it yourself
Manual Automatic

Contains far
Cheaper
fewer errors

Gather
Takes more time
everything you
to collect
can find

More expensive
in general.
Data Collection
2. Someone has already collected it for you

Google’s Dataset Microsoft Research Amazon Datasets UCI Machine Government

Search Open Data Learning Repository Datasets
Data Collection
The Size and Quality of a Data Set Matters

Better Data  Better AI

“Garbage in, garbage out”

• Your model is as good as your data
• How do you measure your data set's quality
• How do you improve data quality?
• How much data do you need to get useful results?
Data Collection
Why is Collecting a Good Dataset Important?

The Google Translate team "...one of our most impactful

has more training data than quality advances since neural
they can use. Rather than machine translation has been
tuning their model, the in identifying the best subset
team has earned bigger of our training data to use“
wins by using the best - Software Engineer, Google
features in their data. Translate

"...most of the times when I "Interesting-looking" errors

tried to manually debug are typically caused by the
interesting-looking errors they data. Faulty data may cause
could be traced back to your model to learn the
issues with the training data." wrong patterns, regardless
- Software Engineer, Google of what modeling
Translate techniques you try.
Data Collection
Size of the Data Matters Quality of the Data Matters

• How much? • No use for lot of data if bad  quality matters, too.
 Minimum: trainable Parameters x 10
• Quality dataset is good if it helps you
• Simple AI on good data > Fancy AI on small data
• Improve the quality of your data by dealing with its
 means that a basic artificial intelligence model trained on a large, high-quality
dataset will likely perform better than a sophisticated AI model trained on a small,
problems
low-quality dataset.

• Google trained simple models on large data sets

• What counts as "a lot" of data?
It depends on the project
• Datasets come in a variety of sizes
Data set Size (number of examples)

Iris flower data set 150 (total set)

MovieLens (the 20M data set) 20,000,263 (total set)

Google Gmail SmartReply 238,000,000 (training set)

Google Translate Trillions

Outline
What is Data? What are the types of data?
Data Collection
Dealing with Data Problems
 Data Privacy and Protection  Encryption, NDAs, Regulation
 Lack of Motivation, and Fatigue  Incentives, Agreements, Crowd Sourcing
 Missing Data  Interpolation, Elimination, New Labeling
 Bias  Detection, Balance Data
 Outliers and Noise  Outliers Detection, Outliers Exclusion, Filtering
 Duplicates  Duplicates Detection and Removal
 Structural Errors  Fixing Capitalization Errors, Fixing Spelling, Merging
 Irrelevant Data  Ethical Elimination
 Poor Data Quality and Reduced Variability  Data Augmentation, Enhancement
 Feature Scale Imbalance, Data Type Issues  Data Transformation
Outline
What is Data? What are the types of data?
Data Collection
Dealing with Data Problems
 Data Privacy and Protection  Encryption, NDAs, Regulation
 Lack of Motivation, and Fatigue  Incentives, Agreements, Crowd Sourcing
 Missing Data  Interpolation, Elimination, New Labeling
 Bias  Detection, Balance Data
 Outliers and Noise  Outliers Detection, Outliers Exclusion, Filtering
 Duplicates  Duplicates Detection and Removal
 Structural Errors  Fixing Capitalization Errors, Fixing Spelling, Merging
 Irrelevant Data  Ethical Elimination
 Poor Data Quality and Reduced Variability  Data Augmentation, Enhancement
 Feature Scale Imbalance, Data Type Issues  Data Transformation
Dealing with Data Problems
Data Problem How to Handle the Problem
Data Privacy and Protection Encryption, NDAs, Regulation

• Data privacy protects personal information from • Implement security measures

unauthorized access, use, or disclosure. • Establish clear policies and procedures
• Important for preventing identity theft, maintaining
trust, and complying with regulations. • Seek permissions/approvals
Outline
What is Data? What are the types of data?
Data Collection
Dealing with Data Problems
 Data Privacy and Protection  Encryption, NDAs, Regulation
 Lack of Motivation, and Fatigue  Incentives, Agreements, Crowd Sourcing
 Missing Data  Interpolation, Elimination, New Labeling
 Bias  Detection, Balance Data
 Outliers and Noise  Outliers Detection, Outliers Exclusion, Filtering
 Duplicates  Duplicates Detection and Removal
 Structural Errors  Fixing Capitalization Errors, Fixing Spelling, Merging
 Irrelevant Data  Ethical Elimination
 Poor Data Quality and Reduced Variability  Data Augmentation, Enhancement
 Feature Scale Imbalance, Data Type Issues  Data Transformation
Dealing with Data Problems
Data Problem How to Handle the Problem
Lack of Motivation and Fatigue Incentives, Agreements, Crowd Sourcing, Quality Checks
• Rushing through the survey
• Providing inconsistent answers
• Skimming instructions
• Selecting the same answer for every question
• Providing false or misleading information
Outline
What is Data? What are the types of data?
Data Collection
Dealing with Data Problems
 Data Privacy and Protection  Encryption, NDAs, Regulation
 Lack of Motivation, and Fatigue  Incentives, Agreements, Crowd Sourcing
 Missing Data  Interpolation, Elimination, New Labeling
 Bias  Detection, Balance Data
 Outliers and Noise  Outliers Detection, Outliers Exclusion, Filtering
 Duplicates  Duplicates Detection and Removal
 Structural Errors  Fixing Capitalization Errors, Fixing Spelling, Merging
 Irrelevant Data  Ethical Elimination
 Poor Data Quality and Reduced Variability  Data Augmentation, Enhancement
 Feature Scale Imbalance, Data Type Issues  Data Transformation
Dealing with Data Problems
Data Problem How to Handle the Problem
Missing Data Interpolation, Elimination, flag and fill, New Labeling
• Data that is not present or incomplete in a dataset Interpolation adjust a function to your data and
uses it to estimate/extrapolate the missing data.
• Happens due to:
• data entry errors Simple Interpolation Method: linear interpolation
 draw a line between close values and use it
• sensor malfunction
• incomplete data collection
Patient 1: (age, heart–rate) = (x1, y1) = (63, 90)
Models require complete data to be accurate Patient 2: (age, heart–rate) = (x2, y2) = (65, 93)
If improperly handled, leads to bias and errors Patient 3: (age, heart–rate) = (x3, y3) = (67, ?)
𝑦2 − 𝑦1
𝑦 = ( 𝑥 − 𝑥1 )+ 𝑦 1
𝑥2 − 𝑥 1

93 − 90
𝑦= ( 𝑥 − 63 ) + 90
65 − 63

93 − 90
𝑦= 𝑦 ( 67 −=
63 ) + 909
65 − 63
Dealing with Data Problems
Data Problem How to Handle the Problem
Missing Data Missing You’re essentially adding
• Data that is not present or incomplete in a dataset
categorical
data
Label a new class for the feature.

• Happens due to: Missing This tells the algorithm that

• data entry errors the value was missing
• sensor malfunction
• incomplete data collection This also gets around the
technical requirement for
Models require complete data to be accurate no missing values.

If improperly handled, leads to bias and errors

Missing Flag the observation with
numeric
data
Can’t or an indicator variable of
missingness
Won’t
Estimate  Then, fill the original

flag and fill missing value with 0 just to

meet the technical
requirement of no missing
values
Outline
What is Data? What are the types of data?
Data Collection
Dealing with Data Problems
 Data Privacy and Protection  Encryption, NDAs, Regulation
 Lack of Motivation, and Fatigue  Incentives, Agreements, Crowd Sourcing
 Missing Data  Interpolation, Elimination, New Labeling
 Bias  Detection, Balance Data
 Outliers and Noise  Outliers Detection, Outliers Exclusion, Filtering
 Duplicates  Duplicates Detection and Removal
 Structural Errors  Fixing Capitalization Errors, Fixing Spelling, Merging
 Irrelevant Data  Ethical Elimination
 Poor Data Quality and Reduced Variability  Data Augmentation, Enhancement
 Feature Scale Imbalance, Data Type Issues  Data Transformation
Dealing with Data Problems
Data Problem How to Handle the Problem
Bias Detection, Balance Data
• When data used to train AI does not represent the Balancing data
population it serves
ensures data used to train AI model is representative
• Leads to unfair or discriminatory employment, and diverse  accurate predictions for all
criminal justice, and healthcare outcomes
Outline
What is Data? What are the types of data?
Data Collection
Dealing with Data Problems
 Data Privacy and Protection  Encryption, NDAs, Regulation
 Lack of Motivation, and Fatigue  Incentives, Agreements, Crowd Sourcing
 Missing Data  Interpolation, Elimination, New Labeling
 Bias  Detection, Balance Data
 Outliers and Noise  Outliers Detection, Outliers Exclusion, Filtering
 Duplicates  Duplicates Detection and Removal
 Structural Errors  Fixing Capitalization Errors, Fixing Spelling, Merging
 Irrelevant Data  Ethical Elimination
 Poor Data Quality and Reduced Variability  Data Augmentation, Enhancement
 Feature Scale Imbalance, Data Type Issues  Data Transformation
Dealing with Data Problems
Data Problem How to Handle the Problem
Outliers and Noise Outliers Detection, Outliers Exclusion, Filtering
• Observations lying abnormally far from others • Removing outliers helps your model’s performance
• Due to measurement error, sampling bias, or • Examine your data carefully to decide whether to
natural variation in the data remove a data outlier
• Lead to inaccurate or biased AI that • Never remove an outlier because it is a "big
overemphasizes the influence of the outlier number." That big number could be very informative
Outline
What is Data? What are the types of data?
Data Collection
Dealing with Data Problems
 Data Privacy and Protection  Encryption, NDAs, Regulation
 Lack of Motivation, and Fatigue  Incentives, Agreements, Crowd Sourcing
 Missing Data  Interpolation, Elimination, New Labeling
 Bias  Detection, Balance Data
 Outliers and Noise  Outliers Detection, Outliers Exclusion, Filtering
 Duplicates  Duplicates Detection and Removal
 Structural Errors  Fixing Capitalization Errors, Fixing Spelling, Merging
 Irrelevant Data  Ethical Elimination
 Poor Data Quality and Reduced Variability  Data Augmentation, Enhancement
 Feature Scale Imbalance, Data Type Issues  Data Transformation
Dealing with Data Problems
Data Problem How to Handle the Problem
Duplicates Duplicates Detection and Removal
Duplicate observations most frequently arise
during data collection, such as when you:
• Combine datasets from multiple places
• Scrape data
• Receive data from clients/other departments
Outline
What is Data? What are the types of data?
Data Collection
Dealing with Data Problems
 Data Privacy and Protection  Encryption, NDAs, Regulation
 Lack of Motivation, and Fatigue  Incentives, Agreements, Crowd Sourcing
 Missing Data  Interpolation, Elimination, New Labeling
 Bias  Detection, Balance Data
 Outliers and Noise  Outliers Detection, Outliers Exclusion, Filtering
 Duplicates  Duplicates Detection and Removal
 Structural Errors  Fixing Capitalization Errors, Fixing Spelling, Merging
 Irrelevant Data  Ethical Elimination
 Poor Data Quality and Reduced Variability  Data Augmentation, Enhancement
 Feature Scale Imbalance, Data Type Issues  Data Transformation
Dealing with Data Problems
Data Problem How to Handle the Problem
Structural Errors Fixing Capitalization Errors, Fixing Spelling, Merging
Structural errors arise during measurement, data
transfer, or poor housekeeping
Check for:
• Typos
• Inconsistent capitalization.
• Mislabeled classes
Dealing with Data Problems
Data Problem How to Handle the Problem
Structural Errors Fixing Capitalization Errors, Fixing Spelling, Merging
Structural errors arise during measurement, data
transfer, or poor housekeeping
Check for:
• Typos
• Inconsistent capitalization.
• Mislabeled classes
Outline
What is Data? What are the types of data?
Data Collection
Dealing with Data Problems
 Data Privacy and Protection  Encryption, NDAs, Regulation
 Lack of Motivation, and Fatigue  Incentives, Agreements, Crowd Sourcing
 Missing Data  Interpolation, Elimination, New Labeling
 Bias  Detection, Balance Data
 Outliers and Noise  Outliers Detection, Outliers Exclusion, Filtering
 Duplicates  Duplicates Detection and Removal
 Structural Errors  Fixing Capitalization Errors, Fixing Spelling, Merging
 Irrelevant Data  Ethical Elimination
 Poor Data Quality and Reduced Variability  Data Augmentation, Enhancement
 Feature Scale Imbalance, Data Type Issues  Data Transformation
Dealing with Data Problems
Data Problem How to Handle the Problem
Irrelevant Data Ethical Elimination
• Observations that don’t fit the specific problem • Checking for irrelevant observations before
• engineering features can save time and effort
For example, building a model for Villas only,
remove apartments • Ethical elimination is a technique used to remove
irrelevant data from the dataset while ensuring that
the ethically sound data and representative of the
population it serves
• Ethical considerations should also consider whether
certain data points may discriminate or harm certain
populations
Outline
What is Data? What are the types of data?
Data Collection
Dealing with Data Problems
 Data Privacy and Protection  Encryption, NDAs, Regulation
 Lack of Motivation, and Fatigue  Incentives, Agreements, Crowd Sourcing
 Missing Data  Interpolation, Elimination, New Labeling
 Bias  Detection, Balance Data
 Outliers and Noise  Outliers Detection, Outliers Exclusion, Filtering
 Duplicates  Duplicates Detection and Removal
 Structural Errors  Fixing Capitalization Errors, Fixing Spelling, Merging
 Irrelevant Data  Ethical Elimination
 Poor Data Quality and Reduced Variability  Data Augmentation, Enhancement
 Feature Scale Imbalance, Data Type Issues  Data Transformation
Dealing with Data Problems
Data Problem How to Handle the Problem
Poor Data Quality and Reduced Variability Data Augmentation, Enhancement
Poor data quality
incomplete, inconsistent, or contains errors
 lead to inaccurate or
biased AI
Reduced variability
Limited in its scope or variety
 does not representative of the population it serves
Dealing with Data Problems
Data Problem How to Handle the Problem
Poor Data Quality and Reduced Variability Data Augmentation, Enhancement
Poor data quality
incomplete, inconsistent, or contains errors
 lead to inaccurate or
biased AI
Reduced variability
Limited in its scope or variety
 does not representative of the population it serves
Dealing with Data Problems
Data Problem How to Handle the Problem
Poor Data Quality and Reduced Variability Data Augmentation, Enhancement
Poor data quality
incomplete, inconsistent, or contains errors Common Augmentation Methods
 lead to inaccurate or
biased AI 1. Mirroring
Reduced variability 2. Random Cropping
Limited in its scope or variety 3. Rotation
 does not representative of the population it serves
4. Shearing
5. Color Shifting
6. Brightness

Original Image Augmented Image

Dealing with Data Problems
Data Problem How to Handle the Problem
Poor Data Quality and Reduced Variability Data Augmentation, Enhancement
Poor data quality
incomplete, inconsistent, or contains errors Common Augmentation Methods
 lead to inaccurate or
biased AI 1. Mirroring
Reduced variability 2. Random Cropping
Limited in its scope or variety 3. Rotation
 does not representative of the population it serves
4. Shearing
5. Color Shifting
6. Brightness

Original Image Augmented Image

Outline
What is Data? What are the types of data?
Data Collection
Dealing with Data Problems
 Data Privacy and Protection  Encryption, NDAs, Regulation
 Lack of Motivation, and Fatigue  Incentives, Agreements, Crowd Sourcing
 Missing Data  Interpolation, Elimination, New Labeling
 Bias  Detection, Balance Data
 Outliers and Noise  Outliers Detection, Outliers Exclusion, Filtering
 Duplicates  Duplicates Detection and Removal
 Structural Errors  Fixing Capitalization Errors, Fixing Spelling, Merging
 Irrelevant Data  Ethical Elimination
 Poor Data Quality and Reduced Variability  Data Augmentation, Enhancement
 Feature Scale Imbalance, Data Type Issues  Data Transformation
Dealing with Data Problems
Data Problem How to Handle the Problem
Feature Scale Imbalance, Data Type Issues Data Transformation
• Data type issues are data not in a format usable • One-hot encoding is used in machine learning to
by the AI model  such as categorical data or quantify categorical data.
text data.
• Splitting the column which contains numerical
categorical data into many columns depending on
• the number of categories present in that column.
Features that have significantly different scales
Each column contains “0” or “1” corresponding to
 impact the accuracy of the AI model.
which column it has been placed
Fruit Categorical Value Price
Apple 1 5
Mango 2 10
Apple 1 15
Orange 3 20

Apple Mango Orange Price

1 0 0 5
0 1 0 10
1 0 0 15
0 0 1 20
Data Transformation

Data Distribution/ Histogram

Categorical

Ordinal Numerical
Data Transformation

Data Normalization
• Transform data features to be on a similar scale which improves the performance and training
stability of the machine learning model.
• Normalization is useful when your data has varying scales, and the algorithm you are using does not
make assumptions about the distribution of your data.
Data Transformation

Data Normalization
Normalization Techniques
Data Transformation

Data Normalization
Log Scaling
Data Transformation

Data Normalization
Feature Clipping
Data Transformation

Data Normalization
Z-Score
Data Transformation

Data Normalization
Linear Scaling vs. Z-Score
Data Transformation

Data Normalization
Linear Scaling
Age Income #Credit Buy
Cards Insurance Age Range = 23 to 65 (42 years)
35 15,000 1 No Income Range = 7k to 45k (AED 38,000)
45 32,000 3 No #Credit Cards Range = 0-4 (4 credit cards)
23 7,000 4 No

55 45,000 3 Yes
Normalization Needed !!
65 12,000 0 Yes

27 20,000 2 No

33 25,000 2 Yes
Data Transformation

Linear Scaling
Age min = 23 Age max = 65 Age max - Age min = 42
Age Age – Age min Age – Age min Age'
------------------
Age max - Age min
35 35-23 = 12 12/42 = 0.29 0.29

45 45-23 = 22 22/42 = 0.52 0.52

23 23-23 = 0 00/42 = 0.00 0.00

55 55-23 = 32 32/42 = 0.76 0.76

65 65-23 = 42 42/42 = 1.00 1.00

27 27-23 = 4 04/42 = 0.10 0.10

33 33-23 =10 10/42 = 0.24 0.24

Data Transformation

Linear Scaling
Income min = 7k Income max = 45k Income max - Income min = 38k
Income Income – Income min Income – Income min Income'
------------------
Income max - Income min
15,000 8,000 8,000/38,000 0.21
32,000 25,000 25,000/38,000 0.66
7,000 0 0/38,000 0.00
45,000 38,000 38,000/38,000 1.00
12,000 5,000 5,000/38,000 0.13
20,000 13,000 13,000/38,000 0.34
25,000 18,000 18,000/38,000 0.47
Data Transformation

Linear Scaling
#Credit Cards min = 0 #Credit Cards max = 4 #Credit Cards max - #Credit Cards min = 4
#Credit Cards #Credit Cards – #Credit Cards – #Credit Cards min #Credit
#Credit Cards min ------------------ Cards '
#Credit Cards max - #Credit Cards min
1 1 1/4 0.25
3 3 3/4 0.75
4 4 4/4 1.00
3 3 3/4 0.75
0 0 0/4 0.00
2 2 2/4 0.50
2 2 2/4 0.50
Data Transformation

Linear Scaling
Age’ Income’ #Credit Buy
Cards’ Insurance Age Range = 23 to 65 (42 years)
0.29 No Income Range = 7k to 45k (AED 38,000)
0.21 0.25
#Credit Cards Range = 0-4 (4 credit cards)
0.52 0.66 0.75 No

0.00 0.00 1.00 No

0.76 1.00 0.75 Yes Normalization Completed !!

1.00 0.13 0.00 Yes

0.10 0.34 0.50 No

0.24 0.47 0.50 Yes

Data Transformation

Data Normalization
Clipping
Room People Inside Humidity Cooling
Temperature (C) (%) Needed
23.2 30 100 High Room Temperature = 16-27 degrees
People Inside = 0-50
24.8 150 65 Medium
Humidity = 60 - 85
22.1 20 67 Low

23.7 30 70 Medium
Normalization Needed !!
44.3 40 80 High

22.5 20 69 Low

14 -10 73 Low
Data Transformation

23.7 30 70 Medium
Normalization Needed !!
44.3 40 80 High

22.5 20 69 Low

14 -10 73 Low
Data Transformation

Data Normalization
Clipping
Room People Humidity’ Cooling
Temperature’ (C) Inside’ (%) Needed
23.2 30 100  85 High Room Temperature = 16-27 degrees
People Inside = 0-50
24.8 150  50 65 Medium
Humidity = 60 - 85
22.1 20 67 Low

23.7 30 70 Medium
Normalization Needed !!
43.3 27.0 40 80 High

22.5 20 69 Low

14.0  16.0 -10  0 73 Low

Data Transformation

Data Normalization
Clipping
Room People Humidity’ Cooling
Temperature’ (C) Inside’ (%) Needed
23.2 30 85 High
Room Temperature = 16-27 degrees
People Inside = 0-50
24.8 50 65 Medium Humidity = 60 - 85
22.1 20 67 Low

23.7 30 70 Medium Normalization Completed !!

27.0 40 80 High

22.5 20 69 Low

16.0 0 73 Low
Data Transformation

Data Normalization
Z-Score
Age Income #Credit Buy
Cards Insurance Age Range = 23 to 65 (42 years)
35 15,000 1 No Income Range = 7k to 45k (AED 38,000)
45 32,000 3 No #Credit Cards Range = 0-4 (4 credit cards)
23 7,000 4 No Normalization Needed !!
55 45,000 3 Yes

65 12,000 0 Yes

27 20,000 2 No

33 25,000 2 Yes
Data Transformation

Z-Score
Age Age – Age mean Age – Age mean Age'
------------------
Age mean = 40.43 Age std dev
Age’ mean = 0
35 -5.43 -0.35 -0.35
Age std dev = 15.31 Age’ std dev = 1
45 4.57 0.30 0.30
23 -17.43 -1.14 -1.14
55 14.57 0.95 0.95
65 24.57 1.61 1.61
27 -13.43 -0.88 -0.88
33 -7.43 -0.49 -0.49
Clean your Data! Years Of Position Salary (k AED)
Experience
Activity 1 Staff 8
1. Fill the missing Data using 2 staff 11
Interpolation
2. Remove Duplicate Observations
3 Staff _
3. Fix Structural Errors 4 Staff 17
4. Remove Outliers
3 Staff 14
5. Apply One Hot Encoding
6 Staff _
7 Staff 26
7 Manager 20
8 Supervisr 30
9 Supervisor 33
Years Of Staff Supervisor Manager Salary (k
Clean your Data! Experience AED)
1 1 0 0 8
Activity
2 1 0 0 11
1. Fill the missing Data using
Interpolation 3 1 0 0 14
2. Remove Duplicate Observations 4 1 0 0 17
3. Fix Structural Errors: Supervisr, staff
3 1 0 0 14
4. Remove Outliers
5. Apply One Hot Encoding 6 1 0 0 23
7 1 0 0 26
7 0 0 1 20
8 0 1 0 30
9 0 1 0 33
Data Normalization
Linear Scaling in Excel
Age Income #Credit Buy
Cards Insurance Age Range = 23 to 65 (42 years)
35 15,000 1 No Income Range = 7k to 45k (AED 38,000)
45 32,000 3 No #Credit Cards Range = 0-4 (4 credit cards)
23 7,000 4 No

55 45,000 3 Yes
Normalization Needed !!
65 12,000 0 Yes

27 20,000 2 No

33 25,000 2 Yes

3374897-CLASS IX AI - PART B - unit-2-DATA LITERACY
No ratings yet
3374897-CLASS IX AI - PART B - unit-2-DATA LITERACY
32 pages
Internship Report Data Science
100% (1)
Internship Report Data Science
58 pages
Ai 9-Data Literacy Notes
No ratings yet
Ai 9-Data Literacy Notes
16 pages
Data Science 2
100% (1)
Data Science 2
55 pages
Unit 2 PPT (BA)
No ratings yet
Unit 2 PPT (BA)
33 pages
Getting Started With Data Science: Grade VIII
No ratings yet
Getting Started With Data Science: Grade VIII
32 pages
Unit-2 Data Literacy
No ratings yet
Unit-2 Data Literacy
6 pages
Coursera - Data Analytics - Course 3
No ratings yet
Coursera - Data Analytics - Course 3
14 pages
Beginners Guide To Data Science - A Twics Guide 1
100% (1)
Beginners Guide To Data Science - A Twics Guide 1
41 pages
Data Science PPT Module 1
100% (1)
Data Science PPT Module 1
24 pages
FDS - Unit 1
No ratings yet
FDS - Unit 1
233 pages
Unit I - Data Science
No ratings yet
Unit I - Data Science
161 pages
Intro
No ratings yet
Intro
144 pages
C20 Combined
No ratings yet
C20 Combined
291 pages
2020 Intro
No ratings yet
2020 Intro
58 pages
Data - Part 1
No ratings yet
Data - Part 1
58 pages
Session1 DataCharacteristics
No ratings yet
Session1 DataCharacteristics
41 pages
Chap1-Overview of Data Science
No ratings yet
Chap1-Overview of Data Science
50 pages
L 4 and 5-Data Cleaning DS-Sa
No ratings yet
L 4 and 5-Data Cleaning DS-Sa
44 pages
03 Data Science Process - Spring-24-25
No ratings yet
03 Data Science Process - Spring-24-25
48 pages
Working With Data - Annotated
No ratings yet
Working With Data - Annotated
62 pages
UNIT - Introduction - DataScience - New
No ratings yet
UNIT - Introduction - DataScience - New
55 pages
CSD101 Fundamentals of Data Science Session 1 and 2
No ratings yet
CSD101 Fundamentals of Data Science Session 1 and 2
53 pages
FDS Unit 1 Notes
No ratings yet
FDS Unit 1 Notes
53 pages
Unit 1
No ratings yet
Unit 1
11 pages
Data - Mining - Warehousing Unit II
No ratings yet
Data - Mining - Warehousing Unit II
39 pages
3-Data Considerations
No ratings yet
3-Data Considerations
46 pages
Working With Data - Annotated
No ratings yet
Working With Data - Annotated
62 pages
Lecture 1 - Introduction To Data Science
No ratings yet
Lecture 1 - Introduction To Data Science
38 pages
AI and ML
No ratings yet
AI and ML
68 pages
Unit 2introduction To Data Literacy
No ratings yet
Unit 2introduction To Data Literacy
15 pages
L1
No ratings yet
L1
44 pages
Week3 02 Dataset Characteristics
No ratings yet
Week3 02 Dataset Characteristics
41 pages
Ai Notes Neural - Data Lit
No ratings yet
Ai Notes Neural - Data Lit
13 pages
Data Science - PPT
No ratings yet
Data Science - PPT
45 pages
Intro To Data Analytics - Cleanup & Transformation
No ratings yet
Intro To Data Analytics - Cleanup & Transformation
30 pages
Gr9 AI Data Literacy-Final
No ratings yet
Gr9 AI Data Literacy-Final
28 pages
BCA Lecture I
No ratings yet
BCA Lecture I
20 pages
ADS TT1 QB Solutions
No ratings yet
ADS TT1 QB Solutions
14 pages
Data Mining and BI - Student Notes 2
No ratings yet
Data Mining and BI - Student Notes 2
40 pages
Module 3 - (Prepare Data For Exploration)
No ratings yet
Module 3 - (Prepare Data For Exploration)
29 pages
Unit 1-Part3-Compressed
No ratings yet
Unit 1-Part3-Compressed
28 pages
Chapter 4 - Data Curation
No ratings yet
Chapter 4 - Data Curation
34 pages
4 - Unit 2 - Lecture 1 Types of DataSet-L1
No ratings yet
4 - Unit 2 - Lecture 1 Types of DataSet-L1
17 pages
Introduction To Data in Machine Learning
No ratings yet
Introduction To Data in Machine Learning
12 pages
Unit 2 Preparing To Model
No ratings yet
Unit 2 Preparing To Model
49 pages
Chapter 2 - Introduction To Data Science
No ratings yet
Chapter 2 - Introduction To Data Science
36 pages
Al Project Cycle
No ratings yet
Al Project Cycle
10 pages
Chapter 2
No ratings yet
Chapter 2
30 pages
UNIT I - Introduction - DataScience - New
No ratings yet
UNIT I - Introduction - DataScience - New
34 pages
ML Unit1.notes
No ratings yet
ML Unit1.notes
8 pages
Chapter 2-2
No ratings yet
Chapter 2-2
34 pages
classVIII DS Student Handbook
No ratings yet
classVIII DS Student Handbook
30 pages
Unit-2 (Data Litrecy)
No ratings yet
Unit-2 (Data Litrecy)
7 pages
Google Certificate Notes
No ratings yet
Google Certificate Notes
36 pages
Data in Machine Learning
No ratings yet
Data in Machine Learning
7 pages
Introduction To Data Science
No ratings yet
Introduction To Data Science
33 pages
Introduction To Data Analytics
No ratings yet
Introduction To Data Analytics
33 pages
TTDS Lectures
No ratings yet
TTDS Lectures
13 pages

Working With Data

Uploaded by

Working With Data

Uploaded by

COE101

Working with Data

Record/Instance/Sample Feature 1 Feature 2 Name Label

Record/Instance/Sample 1 Feature 1 Value Feature 2 Value Label Value 1

Patient Number Weight Glucose Level Pre-diabetic

Patient 2 110 120 Yes

Categorical Data: can be grouped by a defining characteristic

Time Series Data: Data Points vs. Time

Textual Data: Words, sentences, or paragraphs.

Google’s Dataset Microsoft Research Amazon Datasets UCI Machine Government

Better Data  Better AI

“Garbage in, garbage out”

The Google Translate team "...one of our most impactful

"...most of the times when I "Interesting-looking" errors

• Google trained simple models on large data sets

Iris flower data set 150 (total set)

MovieLens (the 20M data set) 20,000,263 (total set)

Google Gmail SmartReply 238,000,000 (training set)

Google Translate Trillions

• Data privacy protects personal information from • Implement security measures

• Happens due to: Missing This tells the algorithm that

If improperly handled, leads to bias and errors

flag and fill missing value with 0 just to

Original Image Augmented Image

Original Image Augmented Image

Original Image Augmented Image

Original Image Augmented Image

Original Image Augmented Image

Original Image Augmented Image

Apple Mango Orange Price

Data Distribution/ Histogram

45 45-23 = 22 22/42 = 0.52 0.52

23 23-23 = 0 00/42 = 0.00 0.00

55 55-23 = 32 32/42 = 0.76 0.76

65 65-23 = 42 42/42 = 1.00 1.00

27 27-23 = 4 04/42 = 0.10 0.10

33 33-23 =10 10/42 = 0.24 0.24

0.00 0.00 1.00 No

0.76 1.00 0.75 Yes Normalization Completed !!

0.10 0.34 0.50 No

0.24 0.47 0.50 Yes

14.0  16.0 -10  0 73 Low

23.7 30 70 Medium Normalization Completed !!

You might also like