0% found this document useful (0 votes)

10 views19 pages

Dealing With Missing Values

The document discusses strategies for handling missing values in datasets, emphasizing the importance of addressing these gaps before using machine learning algorithms. It covers techniques such as identifying missing values, replacing them with NaN, removing rows or columns with missing data, and using imputation methods like mean or median to estimate missing values. Additionally, it introduces the ColumnTransformer for applying different transformations to numerical and categorical data separately.

Uploaded by

Amir Freer

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

10 views19 pages

Dealing With Missing Values

Uploaded by

Amir Freer

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

You are on page 1/ 19

Dealing with Missing

Values
 It is not uncommon in real-world
applications that some samples are missing
one or more values for various reasons.
There could be an error in the data
Dealing collection process, certain fields may have
been left blank in a survey, power outage,
with etc.
Missing
Values  Most ML algorithms are unable to process
these missing values and as such it is
important that we deal with all missing
values within the data before sending it to
the ML algorithm.
Dealing with Missing Values
• Notice the .sum() method will count the number of missing values in each
column
import pandas as pd
one three two
import numpy as np
a 0.376060 NaN 0.111221
b 0.400020 0.164076 0.548106
seriesA = pd.Series(np.random.rand(3), index=['a', 'b', 'c'])
c 0.507972 0.325337 0.137571
seriesB = pd.Series(np.random.rand(4), index=['a', 'b', 'c', 'd'])
d NaN 0.823270 0.816618
seriesC = pd.Series(np.random.rand(3), index=['b', 'c', 'd'])

df = pd.DataFrame({'one' : seriesA,
one 1
'two' : seriesB,
three 1
'three' : seriesC})
two 0
print (df)
dtype: int64
print (df.isnull().sum())
Dealing
•Please note that the call to df.isnull().sum() will only identify the
missing values that are specified as NaN.

with •Missing values can often be represented using values such as ‘?’, -1 ,-
99,-999. Missing values such as these will not show up using isnull(). You
Missing should be able to identify these in your data exploration stage or when
looking for outliers.

Values •A simple way of dealing with this is by replacing the non-standard

missing values with NaN.
Dealing with Missing Values
import pandas as pd
import numpy as np

seriesA = pd.Series([12, 4, 3, '?'], index=['a', 'b', 'c', 'd'])

seriesB = pd.Series([12, 4, 3, '?'], index=['a', 'b', 'c', 'd'])
seriesC = pd.Series([8, 1, '?', 43], index=['a', 'b', 'c', 'd'])

df = pd.DataFrame({'one' : seriesA, one three two

'two' : seriesB, a 12 8 12
'three' : seriesC}) b 4 1 4
c 3 ? 3
print (df) d ? 43 ?
print (df.isnull().sum())
one 0
three 0
two 0
dtype: int64
Dealing with Missing Values
import pandas as pd
import numpy as np In this case we use
replace to replace any
seriesA = pd.Series([12, 4, 3, '?'], index=['a', 'b', 'c', 'd']) occurrence of ‘?’ within
seriesB = pd.Series([12, 4, 3, '?'], index=['a', 'b', 'c', 'd']) the dataframe with NaN.
seriesC = pd.Series([8, 1, '?', 43], index=['a', 'b', 'c', 'd'])

df = pd.DataFrame({'one' : seriesA, one three two

'two' : seriesB, a 12 8 12
'three' : seriesC}) b 4 1 4
c 3 NaN 3
df = df.replace('?', np.NaN) d NaN 43 NaN

print (df) one 1

print (df.isnull().sum()) three 1
two 1
dtype: int64
Dealing with Missing Values
In this case we use replace to replace any occurrence of ‘?’ with NaN within one
column of our dataframe (in this case column ‘one’).

import pandas as pd
import numpy as np

seriesA = pd.Series([12, 4, 3, '?'], index=['a', 'b', 'c', 'd']) one three two
seriesB = pd.Series([12, 4, 3, '?'], index=['a', 'b', 'c', 'd']) a 12 8 12
seriesC = pd.Series([8, 1, '?', 43], index=['a', 'b', 'c', 'd']) b 4 1 4
c 3 ? 3
df = pd.DataFrame({'one' : seriesA, d NaN 43 ?
'two' : seriesB,
'three' : seriesC}) one 1
three 0
df['one'] = df['one'].replace('?',np.NaN) two 0
dtype: int64
print (df)
print (df.isnull().sum())
• One of the easiest ways to deal with missing values is to simply
remove the corresponding features (columns) or rows from the
dataset entirely.
• Rows
• df.dropna() will remove any rows that contain a missing

Dealing value.
• df.dropna(thresh=3) the parameter thresh specifies the

with number of non-NA values that a row must have in order to

be retained.

Missing • df.dropna(subset=[‘A’]) only drop rows where missing

values appear in a specific column in this case column A.

Values • Columns
• df.dropna(axis = 1) will drop columns that have at least one
missing value
• if you want to drop a column of a specific name you can call
df.drop([‘ColumnName'], axis=1)

• Each off the above will return a new DataFrame!

import pandas as pd
import numpy as np

seriesA = pd.Series(np.random.rand(3), index=['a', 'b', 'c'])

seriesB = pd.Series(np.random.rand(4), index=['a', 'b', 'c', 'd'])
seriesC = pd.Series(np.random.rand(3), index=['b', 'c', 'd'])

df = pd.DataFrame({'one' : seriesA,
'two' : seriesB,
'three' : seriesC})

print (df) one three two

print a 0.867059 NaN 0.255192
newdf = df.dropna() b 0.722719 0.420534 0.212348
print (newdf) c 0.328197 0.141678 0.237098
d NaN 0.458063 0.503182
Here we drop any rows
from our dataframe that one three two
contain a missing value. b 0.722719 0.420534 0.212348
c 0.328197 0.141678 0.237098

9
import pandas as pd
import numpy as np

seriesA = pd.Series(np.random.rand(3))
seriesB = pd.Series(np.random.rand(4)) four one three two
seriesC = pd.Series(np.random.rand(5)) 0 0.766476 0.909878 0.476737 0.084872
seriesD = pd.Series(np.random.rand(7)) 1 0.810370 0.285238 0.386073 0.268438
2 0.567595 0.162616 0.213230 0.389272
df = pd.DataFrame({'one' : seriesA, 3 0.958997 NaN 0.579480 0.689228
'two' : seriesB, 4 0.141135 NaN 0.986758 NaN
'three' : seriesC, 5 0.612904 NaN NaN NaN
'four' : seriesD}) 6 0.193091 NaN NaN NaN
print (df)
df = df.dropna(thresh=4, axis=1 ) four three two
print (df) 0 0.766476 0.476737 0.084872
1 0.810370 0.386073 0.268438
Notice above the threshold 2 0.567595 0.213230 0.389272
value is 4. There, a column 3 0.958997 0.579480 0.689228
must have four or more 4 0.141135 0.986758 NaN
non-NA values in order to 5 0.612904 NaN NaN
be retained. 6 0.193091 NaN NaN
• Although the removal of missing values may seem convenient it also
comes with significant disadvantages.

Dealing • If you delete a row that has a missing value then you could
with potentially be deleting useful feature information.

Missing • In many cases you may not have an abundance of data and we may
end up removing many samples which will reduce the accuracy of

Values our model.

• It is typically more preferable to use imputation techniques to

estimate the missing values from the other training samples in our
dataset.
• One of the most common imputation techniques is mean imputation where
we replace a missing value with the mean of the data items in that column.
Dealing • Scikit-learn provides an easy way of applying this method by using the

with SimpleImputer class.

Missing • When creating an SimpleImputer object we can specify the strategy used
for imputing missing values. The most typical is strategy = mean, however
you can also impute by specifying strategy = median, strategy =
Values most_frequent (mode) or stategy= constant.

• The most_frequent strategy replaces the missing value by the most

frequent values. This is a strategy that is useful to use for categorical
features.

• Another feature of the SimpleImputer is that we can specify the missing

value using the parameter missing_values.
from sklearn.impute import SimpleImputer
# note we are using the dataframe we defined at the start of this section

print (df)

imputer = SimpleImputer(missing_values = np.NaN, strategy='mean')

imputer.fit(df)

allValues = imputer.transform(df)

print (allValues) one three two

a 0.030666 NaN 0.921680
In this case we can impute for b 0.351147 0.740355 0.478344
an entire dataset by passing in c 0.449259 0.299911 0.937952
a pandas dataframe (we could d NaN 0.897354 0.168368
also pass in a NumPy array).
[[ 0.03066623 0.64587346 0.92168033]
The array we pass in must be [ 0.3511469 0.7403552 0.47834361]
an exclusively numerical array. [ 0.4492592 0.29991101 0.93795242]
The transform operation will [ 0.27702411 0.89735416 0.16836778]]
always return a NumPy array.
from sklearn.impute import SimpleImputer
# note we are using the dataframe we defined at the start of this section

imputer = SimpleImputer(missing_values = np.NaN, strategy='mean')

imputer.fit(df)

allValues = imputer.transform(df)

print (allValues)

df = pd.DataFrame(data = allValues, columns = df.columns)

print(df)
[[ 0.13159694 0.57903764 0.39386208]
[ 0.73002787 0.92954245 0.16350405]
Notice that above we convert [ 0.67807006 0.58945908 0.60437333]
the NumPy array back into a [ 0.51323162 0.21811138 0.91419672]]
Pandas dataframe using
pd.DataFrame one three two
0 0.131597 0.579038 0.393862
1 0.730028 0.929542 0.163504
2 0.678070 0.589459 0.604373
3 0.513232 0.218111 0.914197
from sklearn.impute import SimpleImputer
# note we are using the dataframe we defined at the start of this section

print (df)

imputer = SimpleImputer(missing_values = np.NaN, strategy='mean')

imputer.fit( df[['one']] )

df['one'] = imputer.transform( df[['one']] )

print (df)
one three two
a 0.068504 NaN 0.265894
In the example above we b 0.323957 0.004214 0.151085
impute for just one specific c 0.311528 0.646061 0.912238
column within our dataframe. d NaN 0.642776 0.467234

one three two

a 0.068504 NaN 0.265894
b 0.323957 0.004214 0.151085
c 0.311528 0.646061 0.912238
d 0.234663 0.642776 0.467234
15
Using ColumnTransformer
• This ColumnTransformer is a class that allows different columns of the input to be
transformed separately and the results combined into a single feature space. This is
useful when dealing with datasets that contain heterogeneous data.
• Let’s illustrate using a slightly different dataset. You will notice there is a missing value in
the Age column. In this example we are going to use ColumnTransformer to perform
different transformations on the numerical data and the categorical data and merge the
result.
import pandas as pd
import numpy as np
from sklearn.preprocessing import OneHotEncoder
from sklearn.compose import ColumnTransformer

seriesA = pd.Series(['A', 'C', 'B', 'B'])

seriesB = pd.Series([21, 18])
seriesC = pd.Series([4, 1, 1])
seriesD = pd.Series(['Computing', 'Biology', 'Computing'])

df = pd.DataFrame({'Grade' : seriesA, 'Age' : seriesB, 'DegreeYear' : seriesC,

'Department' : seriesD})
print (df)
Using ColumnTransformer
• The main parameter in ColumnTransformer is called transformers, which is a list
of tuples (name, transformer, column(s)) specifying the transformer objects to
be applied to subsets of the data.

#separate the categorical and numerical data

categoricalFeatures = ['Grade', 'Department']
numericalFeatures = ['DegreeYear', 'Age']

# On creating the ColumnTransformer we specify a list of transformer

operations
preprocessor = ColumnTransformer( transformers=
[ ('num', SimpleImputer(strategy='median'), numericalFeatures),
('cat1', SimpleImputer(strategy='most_frequent'), categoricalFeatures)] )

transformedData = preprocessor.fit_transform(df)
df = pd.DataFrame(data= transformedData)

print (df)
• Typically if there is 50% or more missing values from a feature, then
that feature would commonly be removed entirely.

Missing • One alternative to deleting a feature that suffers from such a large
number of missing values is to create a new separate binary feature
Values - that indicates if there is a missing value in the original column or not.

Advice • This approach can be useful if the reason for the original missing value
has some relationship to the target variable.

• For example, if the missing feature may contain some sensitive

personal information, a persons willingness to provide this data may
tell us something about that person that might be related to the final
class.

• Typically the binary feature replaces the original feature.

06/02/2025 18
• As we mentioned, deleting instances (rows) that contains one or more
missing values can also result in a very large amount of data being lost.
Unless you have an abundance of data and a relatively small amount of
Missing number of missing values I would not advice employing brute force
deletion of rows.

Values - • Imputation is a useful technique that we can use but it is not advisable

Advice when features have a very large number of missing values.

• A common rule is that imputation is not advisable on a feature that has

30% or more missing values and should never be used on a feature that
has 50% or more missing values.

• There are a broad range of approaches to imputation. For example, we

can build a predictive model that estimates a replacement for a missing
value based on the feature values that are present in the dataset.

06/02/2025 19

Machine Learning
100% (2)
Machine Learning
136 pages
AI Practical 2025
No ratings yet
AI Practical 2025
14 pages
Data Cleaning With Python and Pandas
No ratings yet
Data Cleaning With Python and Pandas
49 pages
Missing Data
No ratings yet
Missing Data
25 pages
Traversing Dataframe Elements Using: Iterrows, Iteritems and Itertuples
No ratings yet
Traversing Dataframe Elements Using: Iterrows, Iteritems and Itertuples
8 pages
Lesson Plan COT 1 MIL
100% (2)
Lesson Plan COT 1 MIL
10 pages
Data Analytics Lab Manual
No ratings yet
Data Analytics Lab Manual
47 pages
Pandas
No ratings yet
Pandas
4 pages
Grade 5 Catch Up Friday
100% (3)
Grade 5 Catch Up Friday
5 pages
Pandas: Import
100% (1)
Pandas: Import
13 pages
Where The Forest Meets The Sea Sample Lesson Plan
100% (2)
Where The Forest Meets The Sea Sample Lesson Plan
29 pages
Ass-2 Ds
No ratings yet
Ass-2 Ds
29 pages
Kenny-230722-Data Cleaning With Python and Pandas - Detecting Missing Values
No ratings yet
Kenny-230722-Data Cleaning With Python and Pandas - Detecting Missing Values
13 pages
Unit 3
No ratings yet
Unit 3
30 pages
ML Practical 03
No ratings yet
ML Practical 03
20 pages
Cse4020 ML Exp 1
No ratings yet
Cse4020 ML Exp 1
6 pages
Lec9 Dealing With Missing Values
No ratings yet
Lec9 Dealing With Missing Values
22 pages
Lecture 8 Handling Missing Values
No ratings yet
Lecture 8 Handling Missing Values
25 pages
Exp-12 Iaiml
No ratings yet
Exp-12 Iaiml
13 pages
Pratical 1: Problem Statement: Solution: Source Code
No ratings yet
Pratical 1: Problem Statement: Solution: Source Code
49 pages
Pratical 1: Problem Statement: Solution: Source Code
No ratings yet
Pratical 1: Problem Statement: Solution: Source Code
49 pages
Pratical 1: Problem Statement: Solution: Source Code
No ratings yet
Pratical 1: Problem Statement: Solution: Source Code
49 pages
Pratical 1: Problem Statement: Solution: Source Code
No ratings yet
Pratical 1: Problem Statement: Solution: Source Code
49 pages
Unit 5 Python
No ratings yet
Unit 5 Python
30 pages
Lab File
No ratings yet
Lab File
96 pages
Pandas Module (Part-I)
No ratings yet
Pandas Module (Part-I)
36 pages
Handling Missing Values in Python
No ratings yet
Handling Missing Values in Python
9 pages
Panas Short Notes
No ratings yet
Panas Short Notes
4 pages
Lecture 4 New Data Pre Processing
No ratings yet
Lecture 4 New Data Pre Processing
41 pages
Missing Data
No ratings yet
Missing Data
14 pages
Data Cleaning in Python
No ratings yet
Data Cleaning in Python
6 pages
IAT-II FDS-Answer Key
No ratings yet
IAT-II FDS-Answer Key
11 pages
Practice 1
No ratings yet
Practice 1
45 pages
Week1 Numpy, Pandas (178) .Ipynb Colab
No ratings yet
Week1 Numpy, Pandas (178) .Ipynb Colab
6 pages
DA Unit 2 15m Handling Missing Data
No ratings yet
DA Unit 2 15m Handling Missing Data
3 pages
Practical File IP Class 12 2022 23
No ratings yet
Practical File IP Class 12 2022 23
49 pages
Pandas - Dataframe - Handling Missing Nan Values
No ratings yet
Pandas - Dataframe - Handling Missing Nan Values
16 pages
Handling The Missing Values
No ratings yet
Handling The Missing Values
4 pages
Code Explanation For Date Types
No ratings yet
Code Explanation For Date Types
8 pages
Dmdw-Lab Manual
No ratings yet
Dmdw-Lab Manual
61 pages
10) Merging Dataframes: # Detecting Duplicates
No ratings yet
10) Merging Dataframes: # Detecting Duplicates
7 pages
Handling Missing Values
No ratings yet
Handling Missing Values
4 pages
Pandas
No ratings yet
Pandas
63 pages
2 - 4 Data Cleaning
No ratings yet
2 - 4 Data Cleaning
24 pages
(Online Teaching) A2 Flyers Speaking Part 2
No ratings yet
(Online Teaching) A2 Flyers Speaking Part 2
11 pages
Exp3 2
No ratings yet
Exp3 2
5 pages
Chai
No ratings yet
Chai
5 pages
Handling Missing Data in Pandas by Jaume Boguñá
No ratings yet
Handling Missing Data in Pandas by Jaume Boguñá
17 pages
Missing Data Values and How To Handle It
No ratings yet
Missing Data Values and How To Handle It
5 pages
jBASE Dataguard
No ratings yet
jBASE Dataguard
140 pages
Coupling UW16.2 KL Ver 1.1
No ratings yet
Coupling UW16.2 KL Ver 1.1
4 pages
How To Handle Missing Data in Python. (Explained in 5 Easy Steps)
No ratings yet
How To Handle Missing Data in Python. (Explained in 5 Easy Steps)
10 pages
Unit3 - 3) Pandas - Ipynb - Colab
No ratings yet
Unit3 - 3) Pandas - Ipynb - Colab
11 pages
Group A Assignment No2 Writeup
No ratings yet
Group A Assignment No2 Writeup
9 pages
Pandas-Missing Values
No ratings yet
Pandas-Missing Values
2 pages
Conditionals: A) - Not Possible
100% (1)
Conditionals: A) - Not Possible
2 pages
L-4 (Handling of Missing Values) .Ipynb - Colab
No ratings yet
L-4 (Handling of Missing Values) .Ipynb - Colab
8 pages
DPPR 3
No ratings yet
DPPR 3
2 pages
PW2 DataCleaning
No ratings yet
PW2 DataCleaning
6 pages
Lecture - 2 Pandas
No ratings yet
Lecture - 2 Pandas
24 pages
DA Lab Manual r22
No ratings yet
DA Lab Manual r22
31 pages
Dataframes UNIT 1 PART 2
No ratings yet
Dataframes UNIT 1 PART 2
33 pages
Form 2 English TIME: 15 Minutes Listening Comprehension: Levels 6 - 7
No ratings yet
Form 2 English TIME: 15 Minutes Listening Comprehension: Levels 6 - 7
11 pages
Win Runner Automation Testing Tool
No ratings yet
Win Runner Automation Testing Tool
13 pages
Hindu Law Notes and Study Material
No ratings yet
Hindu Law Notes and Study Material
17 pages
Ip Practical File
No ratings yet
Ip Practical File
20 pages
Dev Lab Record
No ratings yet
Dev Lab Record
21 pages
Linux VI and Vim Editor: Tutorial and Advanced Features
No ratings yet
Linux VI and Vim Editor: Tutorial and Advanced Features
17 pages
Critical Discourse Analysis and Translation Studis
No ratings yet
Critical Discourse Analysis and Translation Studis
11 pages
الخامس الابتدائي عام بنات - We can Mc Graw Hill الابتدائية منتظم
No ratings yet
الخامس الابتدائي عام بنات - We can Mc Graw Hill الابتدائية منتظم
26 pages
Database Administration Topics
No ratings yet
Database Administration Topics
10 pages
Intel's Haswell CPU Microarchitecture
No ratings yet
Intel's Haswell CPU Microarchitecture
17 pages
Ch-2-Maintenance Support Processes1
No ratings yet
Ch-2-Maintenance Support Processes1
59 pages
Friedman OverviewSpinozasEthics 1978
No ratings yet
Friedman OverviewSpinozasEthics 1978
41 pages
Biradari PDF
No ratings yet
Biradari PDF
13 pages
Software Capability Metrix-Nutonix
No ratings yet
Software Capability Metrix-Nutonix
50 pages
Class 11 Winter Holidays Homework 202425
No ratings yet
Class 11 Winter Holidays Homework 202425
3 pages
Simla Deputation PPT Edexcel
No ratings yet
Simla Deputation PPT Edexcel
8 pages
Q05.2 Group 6 Global Warming
No ratings yet
Q05.2 Group 6 Global Warming
14 pages
Tourist Attractions in Roxas
No ratings yet
Tourist Attractions in Roxas
10 pages
Analyze Translation Quality Rater by Rizky
No ratings yet
Analyze Translation Quality Rater by Rizky
10 pages
Test Student Statistics
No ratings yet
Test Student Statistics
6 pages
Gbio 55 Lec Lesson 4
No ratings yet
Gbio 55 Lec Lesson 4
8 pages
Lesson Plan: Materials Main Aims
No ratings yet
Lesson Plan: Materials Main Aims
2 pages
Have You Ever Wordsearch Conversation Topics Dialogs Crosswords Icebreakers - 116283
No ratings yet
Have You Ever Wordsearch Conversation Topics Dialogs Crosswords Icebreakers - 116283
2 pages
Img 20241120 0008
No ratings yet
Img 20241120 0008
4 pages
CS3351 Dpco Iat2
No ratings yet
CS3351 Dpco Iat2
1 page
The Essential R Reference
From Everand
The Essential R Reference
Mark Gardener
No ratings yet
Profound Python Data Science
From Everand
Profound Python Data Science
Onder Teker
No ratings yet
TensorFlow深度学习项目实战: Chinese Edition
From Everand
TensorFlow深度学习项目实战: Chinese Edition
Posts & Telecom Press
No ratings yet

Dealing With Missing Values

Uploaded by

Dealing With Missing Values

Uploaded by

Dealing with Missing

Values •A simple way of dealing with this is by replacing the non-standard

seriesA = pd.Series([12, 4, 3, '?'], index=['a', 'b', 'c', 'd'])

df = pd.DataFrame({'one' : seriesA, one three two

df = pd.DataFrame({'one' : seriesA, one three two

print (df) one 1

with number of non-NA values that a row must have in order to

Missing • df.dropna(subset=[‘A’]) only drop rows where missing

• Each off the above will return a new DataFrame!

seriesA = pd.Series(np.random.rand(3), index=['a', 'b', 'c'])

print (df) one three two

Values our model.

• It is typically more preferable to use imputation techniques to

with SimpleImputer class.

• The most_frequent strategy replaces the missing value by the most

• Another feature of the SimpleImputer is that we can specify the missing

imputer = SimpleImputer(missing_values = np.NaN, strategy='mean')

print (allValues) one three two

imputer = SimpleImputer(missing_values = np.NaN, strategy='mean')

df = pd.DataFrame(data = allValues, columns = df.columns)

imputer = SimpleImputer(missing_values = np.NaN, strategy='mean')

df['one'] = imputer.transform( df[['one']] )

one three two

seriesA = pd.Series(['A', 'C', 'B', 'B'])

df = pd.DataFrame({'Grade' : seriesA, 'Age' : seriesB, 'DegreeYear' : seriesC,

#separate the categorical and numerical data

# On creating the ColumnTransformer we specify a list of transformer

• For example, if the missing feature may contain some sensitive

• Typically the binary feature replaces the original feature.

Advice when features have a very large number of missing values.

• A common rule is that imputation is not advisable on a feature that has

• There are a broad range of approaches to imputation. For example, we

You might also like