Dealing With Missing Values
Dealing With Missing Values
Values
It is not uncommon in real-world
applications that some samples are missing
one or more values for various reasons.
There could be an error in the data
Dealing collection process, certain fields may have
been left blank in a survey, power outage,
with etc.
Missing
Values Most ML algorithms are unable to process
these missing values and as such it is
important that we deal with all missing
values within the data before sending it to
the ML algorithm.
Dealing with Missing Values
• Notice the .sum() method will count the number of missing values in each
column
import pandas as pd
one three two
import numpy as np
a 0.376060 NaN 0.111221
b 0.400020 0.164076 0.548106
seriesA = pd.Series(np.random.rand(3), index=['a', 'b', 'c'])
c 0.507972 0.325337 0.137571
seriesB = pd.Series(np.random.rand(4), index=['a', 'b', 'c', 'd'])
d NaN 0.823270 0.816618
seriesC = pd.Series(np.random.rand(3), index=['b', 'c', 'd'])
df = pd.DataFrame({'one' : seriesA,
one 1
'two' : seriesB,
three 1
'three' : seriesC})
two 0
print (df)
dtype: int64
print (df.isnull().sum())
Dealing
•Please note that the call to df.isnull().sum() will only identify the
missing values that are specified as NaN.
with •Missing values can often be represented using values such as ‘?’, -1 ,-
99,-999. Missing values such as these will not show up using isnull(). You
Missing should be able to identify these in your data exploration stage or when
looking for outliers.
import pandas as pd
import numpy as np
seriesA = pd.Series([12, 4, 3, '?'], index=['a', 'b', 'c', 'd']) one three two
seriesB = pd.Series([12, 4, 3, '?'], index=['a', 'b', 'c', 'd']) a 12 8 12
seriesC = pd.Series([8, 1, '?', 43], index=['a', 'b', 'c', 'd']) b 4 1 4
c 3 ? 3
df = pd.DataFrame({'one' : seriesA, d NaN 43 ?
'two' : seriesB,
'three' : seriesC}) one 1
three 0
df['one'] = df['one'].replace('?',np.NaN) two 0
dtype: int64
print (df)
print (df.isnull().sum())
• One of the easiest ways to deal with missing values is to simply
remove the corresponding features (columns) or rows from the
dataset entirely.
• Rows
• df.dropna() will remove any rows that contain a missing
Dealing value.
• df.dropna(thresh=3) the parameter thresh specifies the
Values • Columns
• df.dropna(axis = 1) will drop columns that have at least one
missing value
• if you want to drop a column of a specific name you can call
df.drop([‘ColumnName'], axis=1)
df = pd.DataFrame({'one' : seriesA,
'two' : seriesB,
'three' : seriesC})
9
import pandas as pd
import numpy as np
seriesA = pd.Series(np.random.rand(3))
seriesB = pd.Series(np.random.rand(4)) four one three two
seriesC = pd.Series(np.random.rand(5)) 0 0.766476 0.909878 0.476737 0.084872
seriesD = pd.Series(np.random.rand(7)) 1 0.810370 0.285238 0.386073 0.268438
2 0.567595 0.162616 0.213230 0.389272
df = pd.DataFrame({'one' : seriesA, 3 0.958997 NaN 0.579480 0.689228
'two' : seriesB, 4 0.141135 NaN 0.986758 NaN
'three' : seriesC, 5 0.612904 NaN NaN NaN
'four' : seriesD}) 6 0.193091 NaN NaN NaN
print (df)
df = df.dropna(thresh=4, axis=1 ) four three two
print (df) 0 0.766476 0.476737 0.084872
1 0.810370 0.386073 0.268438
Notice above the threshold 2 0.567595 0.213230 0.389272
value is 4. There, a column 3 0.958997 0.579480 0.689228
must have four or more 4 0.141135 0.986758 NaN
non-NA values in order to 5 0.612904 NaN NaN
be retained. 6 0.193091 NaN NaN
• Although the removal of missing values may seem convenient it also
comes with significant disadvantages.
Dealing • If you delete a row that has a missing value then you could
with potentially be deleting useful feature information.
Missing • In many cases you may not have an abundance of data and we may
end up removing many samples which will reduce the accuracy of
Missing • When creating an SimpleImputer object we can specify the strategy used
for imputing missing values. The most typical is strategy = mean, however
you can also impute by specifying strategy = median, strategy =
Values most_frequent (mode) or stategy= constant.
print (df)
imputer.fit(df)
allValues = imputer.transform(df)
imputer.fit(df)
allValues = imputer.transform(df)
print (allValues)
print(df)
[[ 0.13159694 0.57903764 0.39386208]
[ 0.73002787 0.92954245 0.16350405]
Notice that above we convert [ 0.67807006 0.58945908 0.60437333]
the NumPy array back into a [ 0.51323162 0.21811138 0.91419672]]
Pandas dataframe using
pd.DataFrame one three two
0 0.131597 0.579038 0.393862
1 0.730028 0.929542 0.163504
2 0.678070 0.589459 0.604373
3 0.513232 0.218111 0.914197
from sklearn.impute import SimpleImputer
# note we are using the dataframe we defined at the start of this section
print (df)
imputer.fit( df[['one']] )
print (df)
one three two
a 0.068504 NaN 0.265894
In the example above we b 0.323957 0.004214 0.151085
impute for just one specific c 0.311528 0.646061 0.912238
column within our dataframe. d NaN 0.642776 0.467234
transformedData = preprocessor.fit_transform(df)
df = pd.DataFrame(data= transformedData)
print (df)
• Typically if there is 50% or more missing values from a feature, then
that feature would commonly be removed entirely.
Missing • One alternative to deleting a feature that suffers from such a large
number of missing values is to create a new separate binary feature
Values - that indicates if there is a missing value in the original column or not.
Advice • This approach can be useful if the reason for the original missing value
has some relationship to the target variable.
Values - • Imputation is a useful technique that we can use but it is not advisable
06/02/2025 19