NumPy | Replace NaN values with average of columns
Last Updated :
09 Feb, 2024
Data visualization is one of the most important steps in machine learning and data analytics.
Cleaning and arranging data is done by different algorithms. Sometimes in data sets, we get NaN (not a number) values that are unusable for data visualization.
To solve this problem, one possible method is to replace NaN values with an average of columns.
Given below are a few methods to solve this problem.
- Using np.colmean and np.take
- Using np.ma and np.where
- Using Naive and zip
- Using list comprehension and built-in functions
- Using zip()+lambda()
Let us understand them better with Python program examples:
Using np.colmean and np.take
We use the colmean() method of the NumPy library to find the mean of columns. We then use the take() method to replace column mean (average) with NaN values.
Example:
Python3
# Python code to demonstrate
# to replace nan values
# with an average of columns
import numpy as np
# Initialising numpy array
ini_array = np.array([[1.3, 2.5, 3.6, np.nan],
[2.6, 3.3, np.nan, 5.5],
[2.1, 3.2, 5.4, 6.5]])
# printing initial array
print ("initial array", ini_array)
# column mean
col_mean = np.nanmean(ini_array, axis = 0)
# printing column mean
print ("columns mean", str(col_mean))
# find indices where nan value is present
inds = np.where(np.isnan(ini_array))
# replace inds with avg of column
ini_array[inds] = np.take(col_mean, inds[1])
# printing final array
print ("final array", ini_array)
Output:
initial array [[ 1.3 2.5 3.6 nan]
[ 2.6 3.3 nan 5.5]
[ 2.1 3.2 5.4 6.5]]
columns mean [ 2. 3. 4.5 6. ]
final array [[ 1.3 2.5 3.6 6. ]
[ 2.6 3.3 4.5 5.5]
[ 2.1 3.2 5.4 6.5]]
Using np.ma and np.where
We use the ma() method, which allows you to create a masked array where NaN values are masked out. We then use the where() method to replace the NaN values with column averages.
Example:
Python3
# Python code to demonstrate
# to replace nan values
# with average of columns
import numpy as np
# Initialising numpy array
ini_array = np.array([[1.3, 2.5, 3.6, np.nan],
[2.6, 3.3, np.nan, 5.5],
[2.1, 3.2, 5.4, 6.5]])
# printing initial array
print ("initial array", ini_array)
# replace nan with col means
res = np.where(np.isnan(ini_array), np.ma.array(ini_array,
mask = np.isnan(ini_array)).mean(axis = 0), ini_array)
# printing final array
print ("final array", res)
Output:
initial array [[ 1.3 2.5 3.6 nan]
[ 2.6 3.3 nan 5.5]
[ 2.1 3.2 5.4 6.5]]
final array [[ 1.3 2.5 3.6 6. ]
[ 2.6 3.3 4.5 5.5]
[ 2.1 3.2 5.4 6.5]]
Using Naive and zip
We use Zip to pair up the elements from the unpacked arrays, effectively giving us pairs of (row, column) indices for each NaN value in the array. We then replace these values with column averages.
Example:
Python3
# Python code to demonstrate
# to replace nan values
# with average of columns
import numpy as np
# Initialising numpy array
ini_array = np.array([[1.3, 2.5, 3.6, np.nan],
[2.6, 3.3, np.nan, 5.5],
[2.1, 3.2, 5.4, 6.5]])
# printing initial array
print ("initial array", ini_array)
# indices where values is nan in array
indices = np.where(np.isnan(ini_array))
# Iterating over numpy array to replace nan with values
for row, col in zip(*indices):
ini_array[row, col] = np.mean(ini_array[
~np.isnan(ini_array[:, col]), col])
# printing final array
print ("final array", ini_array)
Output:
initial array [[ 1.3 2.5 3.6 nan]
[ 2.6 3.3 nan 5.5]
[ 2.1 3.2 5.4 6.5]]
final array [[ 1.3 2.5 3.6 6. ]
[ 2.6 3.3 4.5 5.5]
[ 2.1 3.2 5.4 6.5]]
Using list comprehension and built-in functions
It first computes the column means using a list comprehension with the help of the filter and zip functions. Then, it replaces the NaN values in the array with the corresponding column means using another list comprehension with the help of the enumerate function. Finally, it returns the modified list.
Algorithm:
1. Compute the column means.
2. Replace the NaN values in the array with the corresponding column means using list comprehension and built-in functions.
3. Return the modified list.
Python3
def replace_nan_with_mean(arr):
col_means = [sum(filter(lambda x: x is not None, col))/len(list(filter(lambda x: x is not None, col))) for col in zip(*arr)]
for i in range(len(arr)):
arr[i] = [col_means[j] if x is None else x for j, x in enumerate(arr[i])]
return arr
arr=[[1.3, 2.5, 3.6, None],
[2.6, 3.3, None, 5.5],
[2.1, 3.2, 5.4, 6.5]]
print(replace_nan_with_mean(arr))
Output[[1.3, 2.5, 3.6, 6.0], [2.6, 3.3, 4.5, 5.5], [2.1, 3.2, 5.4, 6.5]]
Using zip()+lambda()
Compute the column means excluding NaN values using a loop over the transposed array zip(*arr). Replace NaN values with column means using map() and lambda functions.
Algorithm
1. Initialize an empty list means to store the column means.
2. Loop over the transposed array zip(*arr) to iterate over columns.
3. For each column, filter out None values and compute the mean of the remaining values. If there are no remaining values, set the mean to 0.
4. Append the mean to the means list.
5. Use map() and lambda functions to replace None values with the corresponding column mean in each row of the array arr.
6. Return the modified array arr.
Python3
# initial array
arr = [[1.3, 2.5, 3.6, None],
[2.6, 3.3, None, 5.5],
[2.1, 3.2, 5.4, 6.5]]
# compute column means
means = []
for col in zip(*arr):
values = [x for x in col if x is not None]
means.append(sum(values)/len(values) if values else 0)
# replace NaN values with column means
arr = list(map(lambda row: [means[j] if x is None else x for j,x in enumerate(row)], arr))
# print final array
print(arr)
Output[[1.3, 2.5, 3.6, 6.0], [2.6, 3.3, 4.5, 5.5], [2.1, 3.2, 5.4, 6.5]]
Similar Reads
Replace NaN Values with Zeros in Pandas DataFrame NaN stands for Not A Number and is one of the common ways to represent the missing value in the data. It is a special floating-point value and cannot be converted to any other type than float. NaN value is one of the major problems in Data Analysis. It is very essential to deal with NaN in order to
5 min read
Replace all the NaN values with Zero's in a column of a Pandas dataframe Replacing the NaN or the null values in  a dataframe can be easily performed using a single line DataFrame.fillna() and DataFrame.replace() method. We will discuss these methods along with an example demonstrating how to use it.                            DataFrame.fillna()
3 min read
How to Replace Numpy NAN with String Dealing with missing or undefined data is a common challenge in data science and programming. In the realm of numerical computing in Python, the NumPy library is a powerhouse, offering versatile tools for handling arrays and matrices. However, when NaN (not a number) values appear in your data, you
2 min read
Replacing Pandas or Numpy Nan with a None to use with MysqlDB The widely used relational database management system is known as MysqlDB. The MysqlDB doesn't understand and accept the value of 'Nan', thus there is a need to convert the 'Nan' value coming from Pandas or Numpy to 'None'. In this article, we will see how we can replace Pandas or Numpy 'Nan' with a
3 min read
How to Drop Rows with NaN Values in Pandas DataFrame? In Pandas missing values are represented as NaN (Not a Number) which can lead to inaccurate analyses. One common approach to handling missing data is to drop rows containing NaN values using pandas. Below are some methods that can be used:Method 1: Using dropna()The dropna() method is the most strai
2 min read
Python Pandas: Replace Zeros with Previous Non-Zero Value When working with a dataset, it's common to encounter zeros that need to be replaced with non-zero values. This situation arises in various contexts, such as financial data, sensor readings, or any dataset where a zero might indicate missing or temporary invalid data. Python's Pandas library provide
4 min read
How to count number of NaN values in Pandas? Let's discuss how to count the number of NaN values in Pandas DataFrame. In Pandas, NaN (Not a Number) values represent missing data in a DataFrame. Counting NaN values of Each Column of Pandas DataFrameTo find the number of missing (NaN) values in each column, use the isnull() function followed by
3 min read
How to Remove columns in Numpy array that contains non-numeric values? Many times we have non-numeric values in NumPy array. These values need to be removed, so that array will be free from all these unnecessary values and look more decent. It is possible to remove all columns containing Nan values using the Bitwise NOT operator and np.isnan() function. Example 1: Pyth
2 min read
How to Drop Columns with NaN Values in Pandas DataFrame? Nan(Not a number) is a floating-point value which can't be converted into other data type expect to float. In data analysis, Nan is the unnecessary value which must be removed in order to analyze the data set properly. In this article, we will discuss how to remove/drop columns having Nan values in
3 min read
Python | Pandas DataFrame.fillna() to replace Null values in dataframe Python is a great language for doing data analysis, primarily because of the fantastic ecosystem of data-centric Python packages. Pandas is one of those packages and makes importing and analyzing data much easier. Sometimes csv file has null values, which are later displayed as NaN in Data Frame. Ju
5 min read