0% found this document useful (0 votes)
10 views19 pages

Dealing With Missing Values

The document discusses strategies for handling missing values in datasets, emphasizing the importance of addressing these gaps before using machine learning algorithms. It covers techniques such as identifying missing values, replacing them with NaN, removing rows or columns with missing data, and using imputation methods like mean or median to estimate missing values. Additionally, it introduces the ColumnTransformer for applying different transformations to numerical and categorical data separately.

Uploaded by

Amir Freer
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
10 views19 pages

Dealing With Missing Values

The document discusses strategies for handling missing values in datasets, emphasizing the importance of addressing these gaps before using machine learning algorithms. It covers techniques such as identifying missing values, replacing them with NaN, removing rows or columns with missing data, and using imputation methods like mean or median to estimate missing values. Additionally, it introduces the ColumnTransformer for applying different transformations to numerical and categorical data separately.

Uploaded by

Amir Freer
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 19

Dealing with Missing

Values
 It is not uncommon in real-world
applications that some samples are missing
one or more values for various reasons.
There could be an error in the data
Dealing collection process, certain fields may have
been left blank in a survey, power outage,
with etc.
Missing
Values  Most ML algorithms are unable to process
these missing values and as such it is
important that we deal with all missing
values within the data before sending it to
the ML algorithm.
Dealing with Missing Values
• Notice the .sum() method will count the number of missing values in each
column
import pandas as pd
one three two
import numpy as np
a 0.376060 NaN 0.111221
b 0.400020 0.164076 0.548106
seriesA = pd.Series(np.random.rand(3), index=['a', 'b', 'c'])
c 0.507972 0.325337 0.137571
seriesB = pd.Series(np.random.rand(4), index=['a', 'b', 'c', 'd'])
d NaN 0.823270 0.816618
seriesC = pd.Series(np.random.rand(3), index=['b', 'c', 'd'])

df = pd.DataFrame({'one' : seriesA,
one 1
'two' : seriesB,
three 1
'three' : seriesC})
two 0
print (df)
dtype: int64
print (df.isnull().sum())
Dealing
•Please note that the call to df.isnull().sum() will only identify the
missing values that are specified as NaN.

with •Missing values can often be represented using values such as ‘?’, -1 ,-
99,-999. Missing values such as these will not show up using isnull(). You
Missing should be able to identify these in your data exploration stage or when
looking for outliers.

Values •A simple way of dealing with this is by replacing the non-standard


missing values with NaN.
Dealing with Missing Values
import pandas as pd
import numpy as np

seriesA = pd.Series([12, 4, 3, '?'], index=['a', 'b', 'c', 'd'])


seriesB = pd.Series([12, 4, 3, '?'], index=['a', 'b', 'c', 'd'])
seriesC = pd.Series([8, 1, '?', 43], index=['a', 'b', 'c', 'd'])

df = pd.DataFrame({'one' : seriesA, one three two


'two' : seriesB, a 12 8 12
'three' : seriesC}) b 4 1 4
c 3 ? 3
print (df) d ? 43 ?
print (df.isnull().sum())
one 0
three 0
two 0
dtype: int64
Dealing with Missing Values
import pandas as pd
import numpy as np In this case we use
replace to replace any
seriesA = pd.Series([12, 4, 3, '?'], index=['a', 'b', 'c', 'd']) occurrence of ‘?’ within
seriesB = pd.Series([12, 4, 3, '?'], index=['a', 'b', 'c', 'd']) the dataframe with NaN.
seriesC = pd.Series([8, 1, '?', 43], index=['a', 'b', 'c', 'd'])

df = pd.DataFrame({'one' : seriesA, one three two


'two' : seriesB, a 12 8 12
'three' : seriesC}) b 4 1 4
c 3 NaN 3
df = df.replace('?', np.NaN) d NaN 43 NaN

print (df) one 1


print (df.isnull().sum()) three 1
two 1
dtype: int64
Dealing with Missing Values
In this case we use replace to replace any occurrence of ‘?’ with NaN within one
column of our dataframe (in this case column ‘one’).

import pandas as pd
import numpy as np

seriesA = pd.Series([12, 4, 3, '?'], index=['a', 'b', 'c', 'd']) one three two
seriesB = pd.Series([12, 4, 3, '?'], index=['a', 'b', 'c', 'd']) a 12 8 12
seriesC = pd.Series([8, 1, '?', 43], index=['a', 'b', 'c', 'd']) b 4 1 4
c 3 ? 3
df = pd.DataFrame({'one' : seriesA, d NaN 43 ?
'two' : seriesB,
'three' : seriesC}) one 1
three 0
df['one'] = df['one'].replace('?',np.NaN) two 0
dtype: int64
print (df)
print (df.isnull().sum())
• One of the easiest ways to deal with missing values is to simply
remove the corresponding features (columns) or rows from the
dataset entirely.
• Rows
• df.dropna() will remove any rows that contain a missing

Dealing value.
• df.dropna(thresh=3) the parameter thresh specifies the

with number of non-NA values that a row must have in order to


be retained.

Missing • df.dropna(subset=[‘A’]) only drop rows where missing


values appear in a specific column in this case column A.

Values • Columns
• df.dropna(axis = 1) will drop columns that have at least one
missing value
• if you want to drop a column of a specific name you can call
df.drop([‘ColumnName'], axis=1)

• Each off the above will return a new DataFrame!


import pandas as pd
import numpy as np

seriesA = pd.Series(np.random.rand(3), index=['a', 'b', 'c'])


seriesB = pd.Series(np.random.rand(4), index=['a', 'b', 'c', 'd'])
seriesC = pd.Series(np.random.rand(3), index=['b', 'c', 'd'])

df = pd.DataFrame({'one' : seriesA,
'two' : seriesB,
'three' : seriesC})

print (df) one three two


print a 0.867059 NaN 0.255192
newdf = df.dropna() b 0.722719 0.420534 0.212348
print (newdf) c 0.328197 0.141678 0.237098
d NaN 0.458063 0.503182
Here we drop any rows
from our dataframe that one three two
contain a missing value. b 0.722719 0.420534 0.212348
c 0.328197 0.141678 0.237098

9
import pandas as pd
import numpy as np

seriesA = pd.Series(np.random.rand(3))
seriesB = pd.Series(np.random.rand(4)) four one three two
seriesC = pd.Series(np.random.rand(5)) 0 0.766476 0.909878 0.476737 0.084872
seriesD = pd.Series(np.random.rand(7)) 1 0.810370 0.285238 0.386073 0.268438
2 0.567595 0.162616 0.213230 0.389272
df = pd.DataFrame({'one' : seriesA, 3 0.958997 NaN 0.579480 0.689228
'two' : seriesB, 4 0.141135 NaN 0.986758 NaN
'three' : seriesC, 5 0.612904 NaN NaN NaN
'four' : seriesD}) 6 0.193091 NaN NaN NaN
print (df)
df = df.dropna(thresh=4, axis=1 ) four three two
print (df) 0 0.766476 0.476737 0.084872
1 0.810370 0.386073 0.268438
Notice above the threshold 2 0.567595 0.213230 0.389272
value is 4. There, a column 3 0.958997 0.579480 0.689228
must have four or more 4 0.141135 0.986758 NaN
non-NA values in order to 5 0.612904 NaN NaN
be retained. 6 0.193091 NaN NaN
• Although the removal of missing values may seem convenient it also
comes with significant disadvantages.

Dealing • If you delete a row that has a missing value then you could
with potentially be deleting useful feature information.

Missing • In many cases you may not have an abundance of data and we may
end up removing many samples which will reduce the accuracy of

Values our model.

• It is typically more preferable to use imputation techniques to


estimate the missing values from the other training samples in our
dataset.
• One of the most common imputation techniques is mean imputation where
we replace a missing value with the mean of the data items in that column.
Dealing • Scikit-learn provides an easy way of applying this method by using the

with SimpleImputer class.

Missing • When creating an SimpleImputer object we can specify the strategy used
for imputing missing values. The most typical is strategy = mean, however
you can also impute by specifying strategy = median, strategy =
Values most_frequent (mode) or stategy= constant.

• The most_frequent strategy replaces the missing value by the most


frequent values. This is a strategy that is useful to use for categorical
features.

• Another feature of the SimpleImputer is that we can specify the missing


value using the parameter missing_values.
from sklearn.impute import SimpleImputer
# note we are using the dataframe we defined at the start of this section

print (df)

imputer = SimpleImputer(missing_values = np.NaN, strategy='mean')

imputer.fit(df)

allValues = imputer.transform(df)

print (allValues) one three two


a 0.030666 NaN 0.921680
In this case we can impute for b 0.351147 0.740355 0.478344
an entire dataset by passing in c 0.449259 0.299911 0.937952
a pandas dataframe (we could d NaN 0.897354 0.168368
also pass in a NumPy array).
[[ 0.03066623 0.64587346 0.92168033]
The array we pass in must be [ 0.3511469 0.7403552 0.47834361]
an exclusively numerical array. [ 0.4492592 0.29991101 0.93795242]
The transform operation will [ 0.27702411 0.89735416 0.16836778]]
always return a NumPy array.
from sklearn.impute import SimpleImputer
# note we are using the dataframe we defined at the start of this section

imputer = SimpleImputer(missing_values = np.NaN, strategy='mean')

imputer.fit(df)

allValues = imputer.transform(df)

print (allValues)

df = pd.DataFrame(data = allValues, columns = df.columns)

print(df)
[[ 0.13159694 0.57903764 0.39386208]
[ 0.73002787 0.92954245 0.16350405]
Notice that above we convert [ 0.67807006 0.58945908 0.60437333]
the NumPy array back into a [ 0.51323162 0.21811138 0.91419672]]
Pandas dataframe using
pd.DataFrame one three two
0 0.131597 0.579038 0.393862
1 0.730028 0.929542 0.163504
2 0.678070 0.589459 0.604373
3 0.513232 0.218111 0.914197
from sklearn.impute import SimpleImputer
# note we are using the dataframe we defined at the start of this section

print (df)

imputer = SimpleImputer(missing_values = np.NaN, strategy='mean')

imputer.fit( df[['one']] )

df['one'] = imputer.transform( df[['one']] )

print (df)
one three two
a 0.068504 NaN 0.265894
In the example above we b 0.323957 0.004214 0.151085
impute for just one specific c 0.311528 0.646061 0.912238
column within our dataframe. d NaN 0.642776 0.467234

one three two


a 0.068504 NaN 0.265894
b 0.323957 0.004214 0.151085
c 0.311528 0.646061 0.912238
d 0.234663 0.642776 0.467234
15
Using ColumnTransformer
• This ColumnTransformer is a class that allows different columns of the input to be
transformed separately and the results combined into a single feature space. This is
useful when dealing with datasets that contain heterogeneous data.
• Let’s illustrate using a slightly different dataset. You will notice there is a missing value in
the Age column. In this example we are going to use ColumnTransformer to perform
different transformations on the numerical data and the categorical data and merge the
result.
import pandas as pd
import numpy as np
from sklearn.preprocessing import OneHotEncoder
from sklearn.compose import ColumnTransformer

seriesA = pd.Series(['A', 'C', 'B', 'B'])


seriesB = pd.Series([21, 18])
seriesC = pd.Series([4, 1, 1])
seriesD = pd.Series(['Computing', 'Biology', 'Computing'])

df = pd.DataFrame({'Grade' : seriesA, 'Age' : seriesB, 'DegreeYear' : seriesC,


'Department' : seriesD})
print (df)
Using ColumnTransformer
• The main parameter in ColumnTransformer is called transformers, which is a list
of tuples (name, transformer, column(s)) specifying the transformer objects to
be applied to subsets of the data.

#separate the categorical and numerical data


categoricalFeatures = ['Grade', 'Department']
numericalFeatures = ['DegreeYear', 'Age']

# On creating the ColumnTransformer we specify a list of transformer


operations
preprocessor = ColumnTransformer( transformers=
[ ('num', SimpleImputer(strategy='median'), numericalFeatures),
('cat1', SimpleImputer(strategy='most_frequent'), categoricalFeatures)] )

transformedData = preprocessor.fit_transform(df)
df = pd.DataFrame(data= transformedData)

print (df)
• Typically if there is 50% or more missing values from a feature, then
that feature would commonly be removed entirely.

Missing • One alternative to deleting a feature that suffers from such a large
number of missing values is to create a new separate binary feature
Values - that indicates if there is a missing value in the original column or not.

Advice • This approach can be useful if the reason for the original missing value
has some relationship to the target variable.

• For example, if the missing feature may contain some sensitive


personal information, a persons willingness to provide this data may
tell us something about that person that might be related to the final
class.

• Typically the binary feature replaces the original feature.


06/02/2025 18
• As we mentioned, deleting instances (rows) that contains one or more
missing values can also result in a very large amount of data being lost.
Unless you have an abundance of data and a relatively small amount of
Missing number of missing values I would not advice employing brute force
deletion of rows.

Values - • Imputation is a useful technique that we can use but it is not advisable

Advice when features have a very large number of missing values.

• A common rule is that imputation is not advisable on a feature that has


30% or more missing values and should never be used on a feature that
has 50% or more missing values.

• There are a broad range of approaches to imputation. For example, we


can build a predictive model that estimates a replacement for a missing
value based on the feature values that are present in the dataset.

06/02/2025 19

You might also like