0% found this document useful (0 votes)

46 views10 pages

How To Handle Missing Data in Python. (Explained in 5 Easy Steps)

Uploaded by

saiful

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

46 views10 pages

How To Handle Missing Data in Python. (Explained in 5 Easy Steps)

Uploaded by

saiful

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 10

How to Handle Missing Data in Python?

[Explained in 5 Easy
Steps]
BE G I NNE R PYT HO N S T RUC T URE D D AT A

When we work in the data science industry, we’ll need to know how to use NumPy, Pandas, Sklearn, etc., to
create completely end-to-end machine learning models. One of the steps in the data science lifecycle is
Data Cleaning, which is the process of finding and correcting the inaccurate/incorrect data in the dataset.
A part of this process is to do something about the missing data values in the dataset naturally. In real life,
many datasets will have many missing values, and this article will teach you how to handle missing data in
Python.
Learning Objectives

In this article, we will learn all about finding and handling missing data
We will also look at hands-on tutorials that teach beginners how to handle missing data using python
and pandas
Table of contents

Why Fill in the Missing Data?.

How to Know If the Data Has Missing Values?
Different Methods of Dealing With Missing Data
1. Deleting the column with missing data
2. Deleting the row with missing data
3. Filling the Missing Values – Imputation
4. Other imputation methods
5. Filling with a Regression Model

Conclusion
Frequently Asked Questions

Why Fill in the Missing Data?.

It is necessary to fill in missing data values in datasets, as most of the machine learning models that you
want to use will provide an error if you pass NaN values into them. The easiest way is to naturally handle
missing data in Python by just filling them up with 0, but it’s essential to note that this approach can
potentially reduce your model accuracy significantly.

For filling missing values, there are many methods available. For choosing the best method, you need to
understand the type of missing value and its significance, before you start filling/deleting the data to
completely understand how to handle missing data in Python.
Python Code:
See that the data contains many columns like PassengerId, Name, Age, etc. We won’t be working with all
the columns in the dataset, so I am going to be deleting the columns I don’t need.

Import the required libraries that you will be using – numpy and pandas by using import pandas and
import numpy

We will then use the pandas read_csv function to read the dataset.

df.drop("Name",axis=1,inplace=True) df.drop("Ticket",axis=1,inplace=True)

df.drop("PassengerId",axis=1,inplace=True) df.drop("Cabin",axis=1,inplace=True)
df.drop("Embarked",axis=1,inplace=True)

See that there are also categorical values in the dataset, for this, you need to use Label Encoding or One
Hot Encoding.

from sklearn.preprocessing import LabelEncoder le = LabelEncoder() df['Sex'] = le.fit_transform(df['Sex'])

newdf=df

#splitting the data into x and y y = df['Survived'] df.drop("Survived",axis=1,inplace=True)

How to Know If the Data Has Missing Values?

Missing Value Treatment in Python – Missing values are usually represented in the form of Nan or null or
None in the dataset.

df.info() The function can be used to give information about the dataset, including insights into missing
data in Python. This function is one of the most used functions for data analysis. This will provide you with
the column names and the number of non–null values in each column. It will also display the data types of
each column. Thus, we can find out which number columns are where null values are present, and by
looking at the data types, we can have an understanding of which value to replace nulls with when
addressing missing data in Python.

Sometimes though, instead of np.nan null values could be present as empty strings or other values that
represent null values, so we must be careful and make sure that all the null values in our dataset are
np.nan values.

df.info()

<class 'pandas.core.frame.DataFrame'> RangeIndex: 891 entries, 0 to 890 Data columns (total 6 columns): #
Column Non-Null Count Dtype --- ------ -------------- ----- 0 Pclass 891 non-null int64 1 Sex 891 non-null

int64 2 Age 714 non-null float64 3 SibSp 891 non-null int64 4 Parch 891 non-null int64 5 Fare 891 non-null

float64 dtypes: float64(2), int64(4) memory usage: 41.9 KB

See that there are null values in the column Age.

The second way of finding whether we have null values in the data is by using the isnull() function.

print(df.isnull().sum())

Pclass 0 Sex 0 Age 177 SibSp 0 Parch 0 Fare 0 dtype: int64

See that all the null values in the dataset are in the column – Age.

Let’s try fitting the data using logistic regression.

from sklearn.model_selection import train_test_split X_train, X_test,y_train,y_test =

train_test_split(df,y,test_size=0.3) from sklearn.linear_model import LogisticRegression lr =

LogisticRegression() lr.fit(X_train,y_train)

--------------------------------------------------------------------------- ValueError: Input contains NaN,

infinity or a value too large for dtype('float64').

See that the logistic regression model does not work as we have NaN values in the dataset. Only some of
the machine learning algorithms can work with missing data like KNN, which will ignore the values with
Nan values.

Different Methods of Dealing With Missing Data

Let’s now look at the different methods that you can use to deal with the missing data.

1. Deleting the column with missing data

In this case, let’s delete the column, Age and then fit the model and check for accuracy.

But this is an extreme case and should only be used when there are many null values in the column.

updated_df = df.dropna(axis=1)

updated_df.info()

<class 'pandas.core.frame.DataFrame'> RangeIndex: 891 entries, 0 to 890 Data columns (total 5 columns): #
Column Non-Null Count Dtype --- ------ -------------- ----- 0 Pclass 891 non-null int64 1 Sex 891 non-null

int64 2 SibSp 891 non-null int64 3 Parch 891 non-null int64 4 Fare 891 non-null float64 dtypes: float64(1),

int64(4) memory usage: 34.9 KB

from sklearn import metrics from sklearn.model_selection import train_test_split X_train,

X_test,y_train,y_test = train_test_split(updated_df,y,test_size=0.3) from sklearn.linear_model import

LogisticRegression lr = LogisticRegression() lr.fit(X_train,y_train) pred = lr.predict(X_test)
print(metrics.accuracy_score(pred,y_test))
0.7947761194029851

See that we can achieve an accuracy of 79.4%.

The problem with this method is that we may lose valuable information on that feature, as we have deleted
it completely due to some null values.

It should only be used if there are too many null values.

2. Deleting the row with missing data

If there is a certain row with missing data, then you can delete the entire row with all the features in that
row.

axis=1 is used to drop the column with NaN values.

axis=0 is used to drop the row with NaN values.

updated_df = newdf.dropna(axis=0)

y1 = updated_df['Survived'] updated_df.drop("Survived",axis=1,inplace=True)

updated_df.info()

<class 'pandas.core.frame.DataFrame'> Int64Index: 714 entries, 0 to 890 Data columns (total 6 columns): #
Column Non-Null Count Dtype --- ------ -------------- ----- 0 Pclass 714 non-null int64 1 Sex 714 non-null
int64 2 Age 714 non-null float64 3 SibSp 714 non-null int64 4 Parch 714 non-null int64 5 Fare 714 non-null

float64 dtypes: float64(2), int64(4) memory usage: 39.0 KB

from sklearn import metrics from sklearn.model_selection import train_test_split X_train,

X_test,y_train,y_test = train_test_split(updated_df,y1,test_size=0.3) from sklearn.linear_model import

LogisticRegression lr = LogisticRegression() lr.fit(X_train,y_train) pred = lr.predict(X_test)
print(metrics.accuracy_score(pred,y_test))

0.8232558139534883

In this case, see that we are able to achieve better accuracy than before. This is maybe because the
column Age contains more valuable information than we expected.

3. Filling the Missing Values – Imputation

In this case, we will be filling the missing values with a certain number.

The possible ways to do this are:

1. Filling the missing data with the mean or median value if it’s a numerical variable.
2. Filling the missing data with mode if it’s a categorical value.
3. Filling the numerical value with 0 or -999, or some other number that will not occur in the data. This
can be done so that the machine can recognize that the data is not real or is different.
4. Filling the categorical value with a new type for the missing values.

You can use the fillna() function to fill the null values in the dataset.

updated_df = df updated_df['Age']=updated_df['Age'].fillna(updated_df['Age'].mean()) updated_df.info()

<class 'pandas.core.frame.DataFrame'> RangeIndex: 891 entries, 0 to 890 Data columns (total 7 columns): #
Column Non-Null Count Dtype --- ------ -------------- ----- 0 Survived 891 non-null int64 1 Pclass 891 non-
null int64 2 Sex 891 non-null int64 3 Age 891 non-null float64 4 SibSp 891 non-null int64 5 Parch 891 non-

null int64 6 Fare 891 non-null float64 dtypes: float64(2), int64(5) memory usage: 48.9 KB

y1 = updated_df['Survived'] updated_df.drop("Survived",axis=1,inplace=True) from sklearn import metrics from

sklearn.model_selection import train_test_split X_train, X_test,y_train,y_test =

train_test_split(updated_df,y1,test_size=0.3) from sklearn.linear_model import LogisticRegression lr =
LogisticRegression() lr.fit(X_train,y_train) pred = lr.predict(X_test)
print(metrics.accuracy_score(pred,y_test))

0.7798507462686567

The accuracy value comes out to be 77.98% which is a reduction over the previous case.

This will not happen in general; in this case, it means that the mean has not filled the null value properly.

4. Other imputation methods

Just like the fillna function there is another function called interpolate, it uses linear interpolation which
means that it estimates unknown values between two known data points.

We can also use the bfill function which backfills the unknown values with the value in the next row.

Imputation with an additional column

Use the SimpleImputer() function from sklearn module to impute the values.

Pass the strategy as an argument to the function. It can be either mean or mode or median.

The problem with the previous model is that the model does not know whether the values came from the
original data or the imputed value. To make sure the model knows this, we are adding Ageismissing the
column which will have True as value, if it is a null value and False if it is not a null value.

updated_df = df updated_df['Ageismissing'] = updated_df['Age'].isnull() from sklearn.impute import

SimpleImputer my_imputer = SimpleImputer(strategy = 'median') data_new = my_imputer.fit_transform(updated_df)

updated_df.info()

<class 'pandas.core.frame.DataFrame'> RangeIndex: 891 entries, 0 to 890 Data columns (total 7 columns): #
Column Non-Null Count Dtype --- ------ -------------- ----- 0 Pclass 891 non-null int64 1 Sex 891 non-null

int64 2 Age 891 non-null float64 3 SibSp 891 non-null int64 4 Parch 891 non-null int64 5 Fare 891 non-null
float64 6 Ageismissing 891 non-null bool dtypes: bool(1), float64(2), int64(4) memory usage: 42.8 KB

from sklearn import metrics from sklearn.model_selection import train_test_split X_train,

X_test,y_train,y_test = train_test_split(updated_df,y1,test_size=0.3) from sklearn.linear_model import
LogisticRegression lr = LogisticRegression() lr.fit(X_train,y_train) pred = lr.predict(X_test)
print(metrics.accuracy_score(pred,y_test))

0.7649253731343284

5. Filling with a Regression Model

In this case, the null values in one column are filled by fitting a regression model using other columns in
the dataset.

I.e. in this case the regression model will contain all the columns except Age in X and Age in Y.

Then after filling the values in the Age column, then we will use logistic regression to calculate accuracy.

from sklearn.linear_model import LinearRegression lr = LinearRegression() df.head() testdf =

df[df['Age'].isnull()==True] traindf = df[df['Age'].isnull()==False] y = traindf['Age']
traindf.drop("Age",axis=1,inplace=True) lr.fit(traindf,y) testdf.drop("Age",axis=1,inplace=True) pred =

lr.predict(testdf) testdf['Age']= pred

traindf['Age']=y

y = traindf['Survived'] traindf.drop("Survived",axis=1,inplace=True) from sklearn.linear_model import

LogisticRegression lr = LogisticRegression() lr.fit(traindf,y)

LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True, intercept_scaling=1,

l1_ratio=None, max_iter=100, multi_class='auto', n_jobs=None, penalty='l2', random_state=None,

solver='lbfgs', tol=0.0001, verbose=0, warm_start=False)

y_test = testdf['Survived'] testdf.drop("Survived",axis=1,inplace=True) pred = lr.predict(testdf)

print(metrics.accuracy_score(pred,y_test))

0.8361581920903954

See that this model produces more accuracy than the previous model as we are using a specific regression
model for filling in the missing values.

We can also use models KNN for filling in the missing values. But sometimes, using models for imputation
can result in overfitting the data.

Imputing missing values using the regression model allowed us to improve our model compared to
dropping those columns.

But you have to understand that There is no perfect way for filling the missing values in a dataset.

Conclusion

Each of the methods may work well with different types of datasets. You have to experiment with different
techniques to check which approach works best for handling missing data in Python within your dataset.
Understanding why data are missing is crucial for appropriately managing the remaining data. If values are
missing completely at random, the data sample is likely still representative of the population. However, if
the values are missing systematically, the analysis may be biased, emphasizing the importance of practical
techniques for addressing missing data in Python.

Key Takeaways

This article taught us about the different ways of handling missing values in our dataset.
If there are way too many missing values in a column then you can drop that column. Otherwise we can
impute missing values with mean, median and mode.
Some functions that can be used in pandas for handling missing values are the fillna, dropna, bfill and
interpolate.

Frequently Asked Questions

Q1. Which is the best method to fill missing data in Python?

A. There is no “best“ way to fill missing values in pandas per say, however, the function fillna() is the most
widely used function to fill nan values in a dataframe. From this function, you can simply fill the values
according to your column with mean, median and mode.

Q2. What is the general idea of handling missing values in Python?

A. Missing values can bias the results of your machine learning models and can result in decreased
accuracy. That is why we must handle these values in the correct way, so that the data is imputed
correctly.

Q3. How to use the pandas library to handle missing values in a dataset?

A. Pandas has many different functions that you can use to handle missing values. Some of these
functions are the fillna function, the bfill function and the interpolate function.

Article Url - https://www.analyticsvidhya.com/blog/2021/05/dealing-with-missing-values-in-python-a-

complete-guide/

Eddie_4072

A Quick Guide To Quantitative Research in The Social Sciences
No ratings yet
A Quick Guide To Quantitative Research in The Social Sciences
26 pages
Machine Learning
100% (2)
Machine Learning
136 pages
Data Cleaning
No ratings yet
Data Cleaning
13 pages
Data Cleaning With Python and Pandas
No ratings yet
Data Cleaning With Python and Pandas
49 pages
Unit 3
No ratings yet
Unit 3
30 pages
Missing Data Handling
No ratings yet
Missing Data Handling
19 pages
Data Cleaning in Python
No ratings yet
Data Cleaning in Python
6 pages
Data Mining Lab 03
No ratings yet
Data Mining Lab 03
10 pages
Lecture 4 New Data Pre Processing
No ratings yet
Lecture 4 New Data Pre Processing
41 pages
Part A Assignment 6
No ratings yet
Part A Assignment 6
28 pages
Pandas
No ratings yet
Pandas
4 pages
Lec9 Dealing With Missing Values
No ratings yet
Lec9 Dealing With Missing Values
22 pages
Code Explanation For Date Types
No ratings yet
Code Explanation For Date Types
8 pages
Handling Missing Values in Python
No ratings yet
Handling Missing Values in Python
9 pages
Exp-12 Iaiml
No ratings yet
Exp-12 Iaiml
13 pages
Kenny-230722-Data Cleaning With Python and Pandas - Detecting Missing Values
No ratings yet
Kenny-230722-Data Cleaning With Python and Pandas - Detecting Missing Values
13 pages
Data Preprocessing 1
No ratings yet
Data Preprocessing 1
6 pages
Data Cleaning
No ratings yet
Data Cleaning
20 pages
Dmdw-Lab Manual
No ratings yet
Dmdw-Lab Manual
61 pages
Ass-2 Ds
No ratings yet
Ass-2 Ds
29 pages
Dealing With Missing Values
No ratings yet
Dealing With Missing Values
19 pages
Group A Assignment No2 Writeup
No ratings yet
Group A Assignment No2 Writeup
9 pages
AI351 Lecture 1 - Data Preprocessing
No ratings yet
AI351 Lecture 1 - Data Preprocessing
8 pages
Pandas 1
No ratings yet
Pandas 1
13 pages
Lec 4
No ratings yet
Lec 4
9 pages
Lab File
No ratings yet
Lab File
96 pages
5-Demonstrate Missing Value Analysis Using Sample Data.-06!01!2025
No ratings yet
5-Demonstrate Missing Value Analysis Using Sample Data.-06!01!2025
2 pages
ML - Preprocessing - Introduction
No ratings yet
ML - Preprocessing - Introduction
14 pages
Lesson 3. Data Preparation and Structuring 1 Data Cleaning
No ratings yet
Lesson 3. Data Preparation and Structuring 1 Data Cleaning
36 pages
Missing Data
No ratings yet
Missing Data
14 pages
PW2 DataCleaning
No ratings yet
PW2 DataCleaning
6 pages
Overview of Data Cleaning
No ratings yet
Overview of Data Cleaning
17 pages
Missing Values
No ratings yet
Missing Values
3 pages
FDS Unit 2
No ratings yet
FDS Unit 2
8 pages
Phython Example
No ratings yet
Phython Example
12 pages
DS Problem Statements and Codes
No ratings yet
DS Problem Statements and Codes
21 pages
Unit 5 Python
No ratings yet
Unit 5 Python
30 pages
Slides On DataII
No ratings yet
Slides On DataII
26 pages
Lecture 8 Handling Missing Values
No ratings yet
Lecture 8 Handling Missing Values
25 pages
Data Wrangling and Preprocessing
100% (1)
Data Wrangling and Preprocessing
41 pages
ML Practical 03
No ratings yet
ML Practical 03
20 pages
Pandas
No ratings yet
Pandas
30 pages
Missing Data
No ratings yet
Missing Data
25 pages
Handling Missing Values in A Real-Time Dataset During
No ratings yet
Handling Missing Values in A Real-Time Dataset During
5 pages
Practice 1
No ratings yet
Practice 1
45 pages
Reading 5 - Data Preparation
No ratings yet
Reading 5 - Data Preparation
23 pages
DMML Lab Report 03
No ratings yet
DMML Lab Report 03
9 pages
Lab 3 DWM
No ratings yet
Lab 3 DWM
5 pages
2777959-Day 8 - Data Wrangling
No ratings yet
2777959-Day 8 - Data Wrangling
2 pages
6 Different Ways To Compensate For Missing Values in A Dataset
No ratings yet
6 Different Ways To Compensate For Missing Values in A Dataset
12 pages
Data Cleaning in Python
No ratings yet
Data Cleaning in Python
14 pages
Data Analytics Lab Manual
No ratings yet
Data Analytics Lab Manual
47 pages
Feature Engineering
No ratings yet
Feature Engineering
20 pages
DS Lec 6
No ratings yet
DS Lec 6
27 pages
Pandas Module (Part-I)
No ratings yet
Pandas Module (Part-I)
36 pages
Adsl Exp 3 2024
No ratings yet
Adsl Exp 3 2024
11 pages
Chapter 1. Data Preparation
No ratings yet
Chapter 1. Data Preparation
74 pages
Advanced C Concepts and Programming: First Edition
From Everand
Advanced C Concepts and Programming: First Edition
Gayatri
3/5 (1)
Python For Beginners
From Everand
Python For Beginners
Célio Azevedo
No ratings yet
Python for Data Science: Data Science Mastery by Nikhil Khan, #1
From Everand
Python for Data Science: Data Science Mastery by Nikhil Khan, #1
Nikhil Khan
No ratings yet
Profound Python Data Science
From Everand
Profound Python Data Science
Onder Teker
No ratings yet
DT As Regressor-Follow
No ratings yet
DT As Regressor-Follow
4 pages
Decision Tree Humidity Wind Play Tennis
No ratings yet
Decision Tree Humidity Wind Play Tennis
4 pages
SSRN Id297592411
No ratings yet
SSRN Id297592411
13 pages
Framing The Human Dimension in Cyber Security Revised Manuscript
No ratings yet
Framing The Human Dimension in Cyber Security Revised Manuscript
9 pages
CyberSecurityintheGlobalizedWorld-ChallengesforBangladesh Proquest
No ratings yet
CyberSecurityintheGlobalizedWorld-ChallengesforBangladesh Proquest
11 pages
Data Mining For Cyber Security: Varun Chandola, Eric Eilertson, Levent Ert Oz, Gy Orgy Simon and Vipin Kumar
No ratings yet
Data Mining For Cyber Security: Varun Chandola, Eric Eilertson, Levent Ert Oz, Gy Orgy Simon and Vipin Kumar
20 pages
Structs
No ratings yet
Structs
24 pages
Deadlock
No ratings yet
Deadlock
53 pages
Jack K. Strauss Simultaneity and VAR PDF
No ratings yet
Jack K. Strauss Simultaneity and VAR PDF
5 pages
Introduction To ROC Analysis
No ratings yet
Introduction To ROC Analysis
15 pages
Introductory Statistical Inference 1st Edition Nitis Mukhopadhyay 2024 Scribd Download
100% (8)
Introductory Statistical Inference 1st Edition Nitis Mukhopadhyay 2024 Scribd Download
85 pages
MEDLO5012 Design of Experiments 03: Course Code Course Name Credits
No ratings yet
MEDLO5012 Design of Experiments 03: Course Code Course Name Credits
2 pages
1 Nature of Economics
100% (1)
1 Nature of Economics
99 pages
Descriptive Statistics - Grouped Data and Graphs - Math403 - EDA
No ratings yet
Descriptive Statistics - Grouped Data and Graphs - Math403 - EDA
42 pages
Time Series Analysis and Forecasting
No ratings yet
Time Series Analysis and Forecasting
21 pages
Data Analysis Oforphanagehomedonation
No ratings yet
Data Analysis Oforphanagehomedonation
27 pages
CV (Abu Nayem) - University of Padua - Data Science
No ratings yet
CV (Abu Nayem) - University of Padua - Data Science
2 pages
Computing The Variance of A Discrete Probability Distribution Autosaved
No ratings yet
Computing The Variance of A Discrete Probability Distribution Autosaved
31 pages
QT
100% (1)
QT
5 pages
Bba (CBCS) Syllabus: (Applicable For Batch of 2021-22 Onwards)
No ratings yet
Bba (CBCS) Syllabus: (Applicable For Batch of 2021-22 Onwards)
99 pages
Vven
0% (1)
Vven
7 pages
Econometrics Problem Set
No ratings yet
Econometrics Problem Set
5 pages
321 978-1-7281-7605-5/21/$31.00 ©2021 Ieee Icassp 2021
No ratings yet
321 978-1-7281-7605-5/21/$31.00 ©2021 Ieee Icassp 2021
5 pages
Design of Packed Bed Reactor Catalyst Based On Shape, Size
100% (1)
Design of Packed Bed Reactor Catalyst Based On Shape, Size
14 pages
AI-empowered Next-Generation Multiscale Climate Modelling For Mitigation and Adaptation
No ratings yet
AI-empowered Next-Generation Multiscale Climate Modelling For Mitigation and Adaptation
9 pages
Comparative Analysis of Machine Learning Techniques For Indian Liver Disease Patients
No ratings yet
Comparative Analysis of Machine Learning Techniques For Indian Liver Disease Patients
5 pages
Swot LMS2015
No ratings yet
Swot LMS2015
13 pages
MMW Research
No ratings yet
MMW Research
7 pages
Lecture 2 Classifier Performance Metrics
No ratings yet
Lecture 2 Classifier Performance Metrics
60 pages
Portfolio Management (Assignment)
100% (1)
Portfolio Management (Assignment)
4 pages
Mean of A White-Noise Process
No ratings yet
Mean of A White-Noise Process
3 pages
AUCA Descriptive Statistics Lecture 1
No ratings yet
AUCA Descriptive Statistics Lecture 1
87 pages
Cheat-Sheet 02 4x6" Statistics
No ratings yet
Cheat-Sheet 02 4x6" Statistics
2 pages
Parag Pro.
No ratings yet
Parag Pro.
107 pages
272 539 1 PB
No ratings yet
272 539 1 PB
5 pages
Matplotlib - 2D Line Plot
No ratings yet
Matplotlib - 2D Line Plot
12 pages
An Efficient Incremental Clustering Algorithm
No ratings yet
An Efficient Incremental Clustering Algorithm
3 pages

How To Handle Missing Data in Python. (Explained in 5 Easy Steps)

Uploaded by

How To Handle Missing Data in Python. (Explained in 5 Easy Steps)

Uploaded by

How to Handle Missing Data in Python?

Why Fill in the Missing Data?.

Why Fill in the Missing Data?.

from sklearn.preprocessing import LabelEncoder le = LabelEncoder() df['Sex'] = le.fit_transform(df['Sex'])

#splitting the data into x and y y = df['Survived'] df.drop("Survived",axis=1,inplace=True)

How to Know If the Data Has Missing Values?

float64 dtypes: float64(2), int64(4) memory usage: 41.9 KB

See that there are null values in the column Age.

Pclass 0 Sex 0 Age 177 SibSp 0 Parch 0 Fare 0 dtype: int64

Let’s try fitting the data using logistic regression.

from sklearn.model_selection import train_test_split X_train, X_test,y_train,y_test =

train_test_split(df,y,test_size=0.3) from sklearn.linear_model import LogisticRegression lr =

--------------------------------------------------------------------------- ValueError: Input contains NaN,

Different Methods of Dealing With Missing Data

1. Deleting the column with missing data

int64(4) memory usage: 34.9 KB

from sklearn import metrics from sklearn.model_selection import train_test_split X_train,

X_test,y_train,y_test = train_test_split(updated_df,y,test_size=0.3) from sklearn.linear_model import

See that we can achieve an accuracy of 79.4%.

It should only be used if there are too many null values.

2. Deleting the row with missing data

axis=1 is used to drop the column with NaN values.

axis=0 is used to drop the row with NaN values.

float64 dtypes: float64(2), int64(4) memory usage: 39.0 KB

from sklearn import metrics from sklearn.model_selection import train_test_split X_train,

X_test,y_train,y_test = train_test_split(updated_df,y1,test_size=0.3) from sklearn.linear_model import

3. Filling the Missing Values – Imputation

The possible ways to do this are:

updated_df = df updated_df['Age']=updated_df['Age'].fillna(updated_df['Age'].mean()) updated_df.info()

y1 = updated_df['Survived'] updated_df.drop("Survived",axis=1,inplace=True) from sklearn import metrics from

sklearn.model_selection import train_test_split X_train, X_test,y_train,y_test =

4. Other imputation methods

Imputation with an additional column

updated_df = df updated_df['Ageismissing'] = updated_df['Age'].isnull() from sklearn.impute import

from sklearn import metrics from sklearn.model_selection import train_test_split X_train,

5. Filling with a Regression Model

from sklearn.linear_model import LinearRegression lr = LinearRegression() df.head() testdf =

lr.predict(testdf) testdf['Age']= pred

y = traindf['Survived'] traindf.drop("Survived",axis=1,inplace=True) from sklearn.linear_model import

LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True, intercept_scaling=1,

solver='lbfgs', tol=0.0001, verbose=0, warm_start=False)

y_test = testdf['Survived'] testdf.drop("Survived",axis=1,inplace=True) pred = lr.predict(testdf)

Frequently Asked Questions

Q1. Which is the best method to fill missing data in Python?

Q2. What is the general idea of handling missing values in Python?

Article Url - https://www.analyticsvidhya.com/blog/2021/05/dealing-with-missing-values-in-python-a-

You might also like