How to deal with missing values in a Pandas DataFrame?

This recipe helps you deal with missing values in a Pandas DataFrame
Last Updated: 23 Jun 2022

Get access to Data Science projects View all Data Science projects

DATA MUNGING DATA CLEANING PYTHON MACHINE LEARNING RECIPES PANDAS CHEATSHEET ALL TAGS

Recipe Objective

In a dataset its very normal that we can get missing values and we can not use that missing values in models. So how to deal with missing values.

So this is the recipe on how we can deal with missing values in a Pandas DataFrame.

Recipe Objective

Step 1 - Import the library

import pandas as pd import numpy as np

We have imported numpy and pandas which will be needed for the dataset.

Step 2 - Setting up the Data

We have created a dataframe with different features like "first_name", "last_name", "age", "comedy_score" and "Rating_Score". raw_data = {"first_name": ["Sheldon", "Raj", "Leonard", "Howard", "Amy"], "last_name": ["Copper", "Koothrappali", "Hofstadter", "Wolowitz", "Fowler"], "age": [42, 38, np.nan, 41, 35], "Comedy_Score": [9, 7, np.nan, 8, 5], "Rating_Score": [25, 25, 49, np.nan, 70]} df = pd.DataFrame(raw_data, columns = ["first_name", "last_name", "age", "Comedy_Score", "Rating_Score"]) print(df)

Step 3 - Dealing with missing values

Here we will be using different methods to deal with missing values.

Droping missing observations

df_no_missing = df.dropna() print(df_no_missing)

Droping rows where all cells in that row is NA

df_cleaned = df.dropna(how="all") print(df_cleaned)

Creating a new column full of missing values

df3 = df.bfill(); print(df3)

Creating a new column full of missing values

df["location"] = np.nan print(df)

Droping column if they only contain missing values

print(df.dropna(axis=1, how="all"))

Droping rows that contain less than five observations

print(df.dropna(thresh=5))

Filling in missing data with zeros

print(df.fillna(0))

Filling in missing in Comedy_Score with the mean value of Comedy_Score

df["Comedy_Score"].fillna(df["Comedy_Score"].mean(), inplace=True) print(df)

Filling in missing in Comedy_Score with each age’s mean value of Comedy_Score

df["Comedy_Score"].fillna(df.groupby("age")["Comedy_Score"].transform("mean"), inplace=True) print(df)

Selecting the rows of df where age is not NaN and age is not NaN

print(df[df["age"].notnull() & df["Rating_Score"].notnull()]) print(df[df["age"].notnull() & df["Rating_Score"].notnull()].fillna(0))

So the output comes as:

  first_name     last_name   age  Comedy_Score  Rating_Score
0    Sheldon        Copper  42.0           9.0          25.0
1        Raj  Koothrappali  38.0           7.0          25.0
2    Leonard    Hofstadter   NaN           NaN          49.0
3     Howard      Wolowitz  41.0           8.0           NaN
4        Amy        Fowler  35.0           5.0          70.0

  first_name     last_name   age  Comedy_Score  Rating_Score
0    Sheldon        Copper  42.0           9.0          25.0
1        Raj  Koothrappali  38.0           7.0          25.0
4        Amy        Fowler  35.0           5.0          70.0

  first_name     last_name   age  Comedy_Score  Rating_Score
0    Sheldon        Copper  42.0           9.0          25.0
1        Raj  Koothrappali  38.0           7.0          25.0
2    Leonard    Hofstadter   NaN           NaN          49.0
3     Howard      Wolowitz  41.0           8.0           NaN
4        Amy        Fowler  35.0           5.0          70.0

  first_name     last_name   age  Comedy_Score  Rating_Score  location
0    Sheldon        Copper  42.0           9.0          25.0       NaN
1        Raj  Koothrappali  38.0           7.0          25.0       NaN
2    Leonard    Hofstadter   NaN           NaN          49.0       NaN
3     Howard      Wolowitz  41.0           8.0           NaN       NaN
4        Amy        Fowler  35.0           5.0          70.0       NaN

  first_name     last_name   age  Comedy_Score  Rating_Score
0    Sheldon        Copper  42.0           9.0          25.0
1        Raj  Koothrappali  38.0           7.0          25.0
2    Leonard    Hofstadter   NaN           NaN          49.0
3     Howard      Wolowitz  41.0           8.0           NaN
4        Amy        Fowler  35.0           5.0          70.0

  first_name     last_name   age  Comedy_Score  Rating_Score  location
0    Sheldon        Copper  42.0           9.0          25.0       NaN
1        Raj  Koothrappali  38.0           7.0          25.0       NaN
4        Amy        Fowler  35.0           5.0          70.0       NaN

  first_name     last_name   age  Comedy_Score  Rating_Score  location
0    Sheldon        Copper  42.0           9.0          25.0       0.0
1        Raj  Koothrappali  38.0           7.0          25.0       0.0
2    Leonard    Hofstadter   0.0           0.0          49.0       0.0
3     Howard      Wolowitz  41.0           8.0           0.0       0.0
4        Amy        Fowler  35.0           5.0          70.0       0.0

  first_name     last_name   age  Comedy_Score  Rating_Score  location
0    Sheldon        Copper  42.0          9.00          25.0       NaN
1        Raj  Koothrappali  38.0          7.00          25.0       NaN
2    Leonard    Hofstadter   NaN          7.25          49.0       NaN
3     Howard      Wolowitz  41.0          8.00           NaN       NaN
4        Amy        Fowler  35.0          5.00          70.0       NaN

  first_name     last_name   age  Comedy_Score  Rating_Score  location
0    Sheldon        Copper  42.0          9.00          25.0       NaN
1        Raj  Koothrappali  38.0          7.00          25.0       NaN
2    Leonard    Hofstadter   NaN          7.25          49.0       NaN
3     Howard      Wolowitz  41.0          8.00           NaN       NaN
4        Amy        Fowler  35.0          5.00          70.0       NaN

  first_name     last_name   age  Comedy_Score  Rating_Score  location
0    Sheldon        Copper  42.0           9.0          25.0       NaN
1        Raj  Koothrappali  38.0           7.0          25.0       NaN
4        Amy        Fowler  35.0           5.0          70.0       NaN

  first_name     last_name   age  Comedy_Score  Rating_Score  location
0    Sheldon        Copper  42.0           9.0          25.0       0.0
1        Raj  Koothrappali  38.0           7.0          25.0       0.0
4        Amy        Fowler  35.0           5.0          70.0       0.0

Download Materials

iPython Notebook

What Users are saying..

Ameeruddin Mohammed

ETL (Abintio) developer at IBM

I come from a background in Marketing and Analytics and when I developed an interest in Machine Learning algorithms, I did multiple in-class courses from reputed institutions though I got good... Read More