How to deal with missing values in a Pandas DataFrame?

This recipe helps you deal with missing values in a Pandas DataFrame

Recipe Objective

In a dataset its very normal that we can get missing values and we can not use that missing values in models. So how to deal with missing values.

So this is the recipe on how we can deal with missing values in a Pandas DataFrame.

Step 1 - Import the library

import pandas as pd import numpy as np

We have imported numpy and pandas which will be needed for the dataset.

Step 2 - Setting up the Data

We have created a dataframe with different features like "first_name", "last_name", "age", "comedy_score" and "Rating_Score". raw_data = {"first_name": ["Sheldon", "Raj", "Leonard", "Howard", "Amy"], "last_name": ["Copper", "Koothrappali", "Hofstadter", "Wolowitz", "Fowler"], "age": [42, 38, np.nan, 41, 35], "Comedy_Score": [9, 7, np.nan, 8, 5], "Rating_Score": [25, 25, 49, np.nan, 70]} df = pd.DataFrame(raw_data, columns = ["first_name", "last_name", "age", "Comedy_Score", "Rating_Score"]) print(df)

Step 3 - Dealing with missing values

Here we will be using different methods to deal with missing values.

    • Droping missing observations

df_no_missing = df.dropna() print(df_no_missing)

    • Droping rows where all cells in that row is NA

df_cleaned = df.dropna(how="all") print(df_cleaned)

    • Creating a new column full of missing values

df3 = df.bfill(); print(df3)

    • Creating a new column full of missing values

df["location"] = np.nan print(df)

    • Droping column if they only contain missing values

print(df.dropna(axis=1, how="all"))

    • Droping rows that contain less than five observations

print(df.dropna(thresh=5))

    • Filling in missing data with zeros

print(df.fillna(0))

    • Filling in missing in Comedy_Score with the mean value of Comedy_Score

df["Comedy_Score"].fillna(df["Comedy_Score"].mean(), inplace=True) print(df)

    • Filling in missing in Comedy_Score with each age’s mean value of Comedy_Score

df["Comedy_Score"].fillna(df.groupby("age")["Comedy_Score"].transform("mean"), inplace=True) print(df)

    • Selecting the rows of df where age is not NaN and age is not NaN

print(df[df["age"].notnull() & df["Rating_Score"].notnull()]) print(df[df["age"].notnull() & df["Rating_Score"].notnull()].fillna(0))

So the output comes as:

  first_name     last_name   age  Comedy_Score  Rating_Score
0    Sheldon        Copper  42.0           9.0          25.0
1        Raj  Koothrappali  38.0           7.0          25.0
2    Leonard    Hofstadter   NaN           NaN          49.0
3     Howard      Wolowitz  41.0           8.0           NaN
4        Amy        Fowler  35.0           5.0          70.0

  first_name     last_name   age  Comedy_Score  Rating_Score
0    Sheldon        Copper  42.0           9.0          25.0
1        Raj  Koothrappali  38.0           7.0          25.0
4        Amy        Fowler  35.0           5.0          70.0

  first_name     last_name   age  Comedy_Score  Rating_Score
0    Sheldon        Copper  42.0           9.0          25.0
1        Raj  Koothrappali  38.0           7.0          25.0
2    Leonard    Hofstadter   NaN           NaN          49.0
3     Howard      Wolowitz  41.0           8.0           NaN
4        Amy        Fowler  35.0           5.0          70.0

  first_name     last_name   age  Comedy_Score  Rating_Score  location
0    Sheldon        Copper  42.0           9.0          25.0       NaN
1        Raj  Koothrappali  38.0           7.0          25.0       NaN
2    Leonard    Hofstadter   NaN           NaN          49.0       NaN
3     Howard      Wolowitz  41.0           8.0           NaN       NaN
4        Amy        Fowler  35.0           5.0          70.0       NaN

  first_name     last_name   age  Comedy_Score  Rating_Score
0    Sheldon        Copper  42.0           9.0          25.0
1        Raj  Koothrappali  38.0           7.0          25.0
2    Leonard    Hofstadter   NaN           NaN          49.0
3     Howard      Wolowitz  41.0           8.0           NaN
4        Amy        Fowler  35.0           5.0          70.0

  first_name     last_name   age  Comedy_Score  Rating_Score  location
0    Sheldon        Copper  42.0           9.0          25.0       NaN
1        Raj  Koothrappali  38.0           7.0          25.0       NaN
4        Amy        Fowler  35.0           5.0          70.0       NaN

  first_name     last_name   age  Comedy_Score  Rating_Score  location
0    Sheldon        Copper  42.0           9.0          25.0       0.0
1        Raj  Koothrappali  38.0           7.0          25.0       0.0
2    Leonard    Hofstadter   0.0           0.0          49.0       0.0
3     Howard      Wolowitz  41.0           8.0           0.0       0.0
4        Amy        Fowler  35.0           5.0          70.0       0.0

  first_name     last_name   age  Comedy_Score  Rating_Score  location
0    Sheldon        Copper  42.0          9.00          25.0       NaN
1        Raj  Koothrappali  38.0          7.00          25.0       NaN
2    Leonard    Hofstadter   NaN          7.25          49.0       NaN
3     Howard      Wolowitz  41.0          8.00           NaN       NaN
4        Amy        Fowler  35.0          5.00          70.0       NaN

  first_name     last_name   age  Comedy_Score  Rating_Score  location
0    Sheldon        Copper  42.0          9.00          25.0       NaN
1        Raj  Koothrappali  38.0          7.00          25.0       NaN
2    Leonard    Hofstadter   NaN          7.25          49.0       NaN
3     Howard      Wolowitz  41.0          8.00           NaN       NaN
4        Amy        Fowler  35.0          5.00          70.0       NaN

  first_name     last_name   age  Comedy_Score  Rating_Score  location
0    Sheldon        Copper  42.0           9.0          25.0       NaN
1        Raj  Koothrappali  38.0           7.0          25.0       NaN
4        Amy        Fowler  35.0           5.0          70.0       NaN

  first_name     last_name   age  Comedy_Score  Rating_Score  location
0    Sheldon        Copper  42.0           9.0          25.0       0.0
1        Raj  Koothrappali  38.0           7.0          25.0       0.0
4        Amy        Fowler  35.0           5.0          70.0       0.0
​


Download Materials


What Users are saying..

profile image

Ameeruddin Mohammed

ETL (Abintio) developer at IBM
linkedin profile url

I come from a background in Marketing and Analytics and when I developed an interest in Machine Learning algorithms, I did multiple in-class courses from reputed institutions though I got good... Read More

Relevant Projects

Learn How to Build PyTorch Neural Networks from Scratch
In this deep learning project, you will learn how to build PyTorch neural networks from scratch.

Customer Market Basket Analysis using Apriori and Fpgrowth algorithms
In this data science project, you will learn how to perform market basket analysis with the application of Apriori and FP growth algorithms based on the concept of association rule learning.

Credit Card Fraud Detection as a Classification Problem
In this data science project, we will predict the credit card fraud in the transactional dataset using some of the predictive models.

Deep Learning Project for Beginners with Source Code Part 1
Learn to implement deep neural networks in Python .

Linear Regression Model Project in Python for Beginners Part 2
Machine Learning Linear Regression Project for Beginners in Python to Build a Multiple Linear Regression Model on Soccer Player Dataset.

Mastering A/B Testing: A Practical Guide for Production
In this A/B Testing for Machine Learning Project, you will gain hands-on experience in conducting A/B tests, analyzing statistical significance, and understanding the challenges of building a solution for A/B testing in a production environment.

Build a Graph Based Recommendation System in Python -Part 1
Python Recommender Systems Project - Learn to build a graph based recommendation system in eCommerce to recommend products.

CycleGAN Implementation for Image-To-Image Translation
In this GAN Deep Learning Project, you will learn how to build an image to image translation model in PyTorch with Cycle GAN.

Walmart Sales Forecasting Data Science Project
Data Science Project in R-Predict the sales for each department using historical markdown data from the Walmart dataset containing data of 45 Walmart stores.

AWS MLOps Project for ARCH and GARCH Time Series Models
Build and deploy ARCH and GARCH time series forecasting models in Python on AWS .