NumPy - Imputing Missing Data



Imputing Missing Data in Arrays

Imputing missing data in arrays involves filling in the missing values with estimated or calculated values based on the available data. This process helps in the following ways −

  • Preserve Data: Avoids loss of information that might be important for analysis.
  • Improve Analysis: Ensures complete datasets, which can lead to more accurate analyses.
  • Handle Missing Data: Addresses gaps in data that could distort results if left unhandled.

Imputing Missing Data with Mean

Imputing Missing Data with Mean is a technique used to fill in missing values in a dataset by replacing them with the mean (average) value of the available data.

The mean value, often referred to as the average, is a measure of central tendency that summarizes a set of numbers by finding their central value.

It is calculated by adding together all the numbers in a dataset and then dividing the sum by the count of those numbers.

Example

In the following example, we calculate the mean of non-NaN values in an array and then use this mean to replace NaN values −

Open Compiler
import numpy as np # Creating an array with NaN values arr = np.array([1.0, 2.5, np.nan, 4.7, np.nan, 6.2]) # Calculating the mean of non-NaN values mean_value = np.nanmean(arr) # Imputing NaN values with the mean imputed_arr = np.where(np.isnan(arr), mean_value, arr) print("Original Array:\n", arr) print("Mean Value:", mean_value) print("Imputed Array:\n", imputed_arr)

Following is the output obtained −

Original Array:[1.  2.5 nan 4.7 nan 6.2]
Mean Value: 3.5999999999999996
Imputed Array:[1.  2.5 3.6 4.7 3.6 6.2]

Imputing Missing Data with Median

Imputing Missing Data with Median is a technique used to fill in missing values in a dataset by replacing them with the median value of the available data.

The median is the middle value in a dataset when it is ordered, or the average of the two middle values if the dataset has an even number of observations.

Example

In this example, we are calculating the median of non-NaN values in an array and then using this median to replace NaN values −

Open Compiler
import numpy as np # Creating an array with NaN values arr = np.array([1.0, 2.5, np.nan, 4.7, np.nan, 6.2]) # Calculating the median of non-NaN values median_value = np.nanmedian(arr) # Imputing NaN values with the median imputed_arr = np.where(np.isnan(arr), median_value, arr) print("Original Array:\n", arr) print("Median Value:", median_value) print("Imputed Array:\n", imputed_arr)

This will produce the following result −

Original Array: [1.  2.5 nan 4.7 nan 6.2]
Median Value: 3.6
Imputed Array: [1.  2.5 3.6 4.7 3.6 6.2]

Imputing Missing Data with a Constant

Imputing Missing Data with a Constant is a technique used to fill in missing values in a dataset by replacing them with a predefined constant value.

A constant value refers to a fixed, unchanging number or value that remains the same throughout a particular context or operation.

Example

In the example below, we define a constant value for imputation and replace NaN values in an array with this constant −

Open Compiler
import numpy as np # Creating an array with NaN values arr = np.array([1.0, 2.5, np.nan, 4.7, np.nan, 6.2]) # Defining a constant value for imputation constant_value = 0 # Imputing NaN values with the constant imputed_arr = np.where(np.isnan(arr), constant_value, arr) print("Original Array:\n", arr) print("Constant Value:", constant_value) print("Imputed Array:\n", imputed_arr)

Following is the output of the above code −

Original Array: [1.  2.5 nan 4.7 nan 6.2]
Constant Value: 0
Imputed Array:[1.  2.5 0.  4.7 0.  6.2]

Imputing Missing Data in Multi-dimensional Arrays

Imputing Missing Data in Multi-dimensional Arrays involves filling in missing values within arrays that have more than one dimension, such as 2D matrices or higher-dimensional arrays.

Example: Imputing Missing Data in a 2D Array

In the following example, we calculate the mean of each column in a 2D array while ignoring NaN values. We then replace the NaN values with the mean of their respective columns −

Open Compiler
import numpy as np # Creating a 2D array with NaN values arr_2d = np.array([[1.0, np.nan, 3.5], [np.nan, 5.1, 6.3], [7.2, 8.1, np.nan]]) # Imputing NaN values with the mean of each column column_means = np.nanmean(arr_2d, axis=0) inds = np.where(np.isnan(arr_2d)) # Replace NaN values with the mean of the respective column arr_2d[inds] = np.take(column_means, inds[1]) print("Original 2D Array:\n", arr_2d) print("Column Means:", column_means) print("Imputed 2D Array:\n", arr_2d)

The output obtained is as shown below −

Original 2D Array:
[[1.  6.6 3.5]
 [4.1 5.1 6.3]
 [7.2 8.1 4.9]]
 
Column Means: [4.1 6.6 4.9]

Imputed 2D Array:
[[1.  6.6 3.5]
 [4.1 5.1 6.3]
 [7.2 8.1 4.9]]

Example: Imputing Missing Data in a 3D Array

Here, we are calculating the median value for each column across all slices of a 3D array while ignoring NaNs. We then replace NaN values with the corresponding median value for each column −

Open Compiler
import numpy as np # Create a 3D array with some NaN values arr_3d = np.array([[[1.0, 2.0, np.nan], [np.nan, 5.0, 6.0], [7.0, np.nan, 9.0]], [[np.nan, 2.0, 3.0], [4.0, np.nan, np.nan], [7.0, 8.0, np.nan]]]) # Calculate the median of each slice along the last axis, ignoring NaN values median_value = np.nanmedian(arr_3d, axis=(0, 1)) # Find indices where NaN values are present nan_indices = np.isnan(arr_3d) # Replace NaN values with the median value of the corresponding slice for i in range(arr_3d.shape[2]): # Iterate over the third dimension arr_3d[:, :, i][nan_indices[:, :, i]] = median_value[i] print("3D array after median imputation:") print(arr_3d)

After executing the above code, we get the following output −

3D array after median imputation:
[[[1.  2.  6. ]
  [5.5 5.  6. ]
  [7.  3.5 9. ]]

 [[5.5 2.  3. ]
  [4.  3.5 6. ]
  [7.  8.  6. ]]]

Imputing with Linear Interpolation

Imputing missing data using linear interpolation involves estimating the missing values based on the values that surround them. This technique is useful for data that is sequential or spatial, where the missing values can be inferred by the values that precede and follow them.

  • Linear interpolation is a method of estimating unknown values that fall between known values.
  • In one-dimensional data, it involves drawing a straight line between two known points and using this line to estimate the value at a point in between.
  • For multi-dimensional data, linear interpolation can extend this concept to higher dimensions.

Example

In the example below, we use linear interpolation to fill missing values (NaNs) in a 1D array. We achieve this by estimating NaN values based on the surrounding non-NaN values −

Open Compiler
import numpy as np from scipy import interpolate # Creating an array with NaN values arr = np.array([1.0, np.nan, 3.5, np.nan, 5.0]) # Interpolating missing values nans, x = np.isnan(arr), lambda z: z.nonzero()[0] arr[nans] = np.interp(x(nans), x(~nans), arr[~nans]) print("Original Array:\n", arr) print("Array with Interpolated Values:\n", arr)

We get the output as shown below −

Original Array:
[1.   2.25 3.5  4.25 5.  ]
Array with Interpolated Values:
[1.   2.25 3.5  4.25 5.  ]
Advertisements