0% found this document useful (0 votes)

3 views4 pages

Handling Missing Values

The Data Cleaning course on Kaggle Learn focuses on addressing common data cleaning challenges in data science, such as handling missing values and inconsistent data formats. Participants will engage in hands-on exercises using real datasets, including one related to American Football games, to develop practical skills in data cleaning techniques. The course emphasizes understanding the reasons behind missing data and offers strategies for managing it effectively.

Uploaded by

Teo Chee Kiat

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

3 views4 pages

Handling Missing Values

Uploaded by

Teo Chee Kiat

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 4

Welcome to the Data Cleaning course on Kaggle Learn!

Data cleaning is a key part of data science, but it can be deeply frustrating. Why are some of your text fields garbled? What should you do about those missing values? Why aren’t your dates formatted correctly? How can you
quickly clean up inconsistent data entry? In this course, you'll learn why you've run into these problems and, more importantly, how to fix them!

In this course, you’ll learn how to tackle some of the most common data cleaning problems so you can get to actually analyzing your data faster. You’ll work through five hands-on exercises with real, messy data and answer
some of your most commonly-asked data cleaning questions.

In this notebook, we'll look at how to deal with missing values.

Take a first look at the data

The first thing we'll need to do is load in the libraries and dataset we'll be using.

For demonstration, we'll use a dataset of events that occured in American Football games. In the following exercise (https://www.kaggle.com/kernels/fork/10824396), you'll apply your new skills to a dataset of building permits
issued in San Francisco.

In [1]:
# modules we'll use
import pandas as pd
import numpy as np

# read in all our data

nfl_data = pd.read_csv("../input/nflplaybyplay2009to2016/NFL Play by Play 2009-2017 (v4).csv")

# set seed for reproducibility

np.random.seed(0)

/opt/conda/lib/python3.7/site-packages/IPython/core/interactiveshell.py:3553: DtypeWarning: Columns (25,51) have mixed types.Specify dtype option on impo
rt or set low_memory=False.
exec(code_obj, self.user_global_ns, self.user_ns)

The first thing to do when you get a new dataset is take a look at some of it. This lets you see that it all read in correctly and gives an idea of what's going on with the data. In this case, let's see if there are any missing values,
which will be reprsented with NaN or None .

In [2]:
# look at the first five rows of the nfl_data file.
# I can see a handful of missing data already!
nfl_data.head()

Out[2]:

Drive qtr down time TimeUnder TimeSecs PlayTimeDiff SideofField ... yacEPA Home_WP_pre Away_WP_pre Home_WP_post Away_WP_post Win_Prob WPA airWPA yacWPA Seaso

000 1 1 NaN 15:00 15 3600.0 0.0 TEN ... NaN 0.485675 0.514325 0.546433 0.453567 0.485675 0.060758 NaN NaN 2009

000 1 1 1.0 14:53 15 3593.0 7.0 PIT ... 1.146076 0.546433 0.453567 0.551088 0.448912 0.546433 0.004655 -0.032244 0.036899 2009

000 1 1 2.0 14:16 15 3556.0 37.0 PIT ... NaN 0.551088 0.448912 0.510793 0.489207 0.551088 -0.040295 NaN NaN 2009

000 1 1 3.0 13:35 14 3515.0 41.0 PIT ... -5.031425 0.510793 0.489207 0.461217 0.538783 0.510793 -0.049576 0.106663 -0.156239 2009

000 1 1 4.0 13:27 14 3507.0 8.0 PIT ... NaN 0.461217 0.538783 0.558929 0.441071 0.461217 0.097712 NaN NaN 2009

5 rows × 102 columns

Yep, it looks like there's some missing values.

How many missing data points do we have?

Ok, now we know that we do have some missing values. Let's see how many we have in each column.

In [3]:
# get the number of missing data points per column
missing_values_count = nfl_data.isnull().sum()

# look at the # of missing points in the first ten columns

missing_values_count[0:10]

Out[3]:
Date 0
GameID 0
Drive 0
qtr 0
down 61154
time 224
TimeUnder 0
TimeSecs 224
PlayTimeDiff 444
SideofField 528
dtype: int64
That seems like a lot! It might be helpful to see what percentage of the values in our dataset were missing to give us a better sense of the scale of this problem:

In [4]:
# how many total missing values do we have?
total_cells = np.product(nfl_data.shape)
total_missing = missing_values_count.sum()

# percent of data that is missing

percent_missing = (total_missing/total_cells) * 100
print(percent_missing)

24.87214126835169

Wow, almost a quarter of the cells in this dataset are empty! In the next step, we're going to take a closer look at some of the columns with missing values and try to figure out what might be going on with them.

Figure out why the data is missing

This is the point at which we get into the part of data science that I like to call "data intution", by which I mean "really looking at your data and trying to figure out why it is the way it is and how that will affect your analysis". It can
be a frustrating part of data science, especially if you're newer to the field and don't have a lot of experience. For dealing with missing values, you'll need to use your intution to figure out why the value is missing. One of the most
important questions you can ask yourself to help figure this out is this:

Is this value missing because it wasn't recorded or because it doesn't exist?

If a value is missing becuase it doesn't exist (like the height of the oldest child of someone who doesn't have any children) then it doesn't make sense to try and guess what it might be. These values you probably do want to keep
as NaN . On the other hand, if a value is missing because it wasn't recorded, then you can try to guess what it might have been based on the other values in that column and row. This is called imputation, and we'll learn how to
do it next! :)

Let's work through an example. Looking at the number of missing values in the nfl_data dataframe, I notice that the column "TimesSec" has a lot of missing values in it:

In [5]:
# look at the # of missing points in the first ten columns
missing_values_count[0:10]

Out[5]:
Date 0
GameID 0
Drive 0
qtr 0
down 61154
time 224
TimeUnder 0
TimeSecs 224
PlayTimeDiff 444
SideofField 528
dtype: int64

By looking at the documentation (https://www.kaggle.com/maxhorowitz/nflplaybyplay2009to2016), I can see that this column has information on the number of seconds left in the game when the play was made. This means that
these values are probably missing because they were not recorded, rather than because they don't exist. So, it would make sense for us to try and guess what they should be rather than just leaving them as NA's.

On the other hand, there are other fields, like "PenalizedTeam" that also have lot of missing fields. In this case, though, the field is missing because if there was no penalty then it doesn't make sense to say which team was
penalized. For this column, it would make more sense to either leave it empty or to add a third value like "neither" and use that to replace the NA's.

Tip: This is a great place to read over the dataset documentation if you haven't already! If you're working with a dataset that you've gotten from another person, you can also try reaching out to them to get more
information.

If you're doing very careful data analysis, this is the point at which you'd look at each column individually to figure out the best strategy for filling those missing values. For the rest of this notebook, we'll cover some "quick and
dirty" techniques that can help you with missing values but will probably also end up removing some useful information or adding some noise to your data.

Drop missing values

If you're in a hurry or don't have a reason to figure out why your values are missing, one option you have is to just remove any rows or columns that contain missing values. (Note: I don't generally recommend this approch for
important projects! It's usually worth it to take the time to go through your data and really look at all the columns with missing values one-by-one to really get to know your dataset.)

If you're sure you want to drop rows with missing values, pandas does have a handy function, dropna() to help you do this. Let's try it out on our NFL dataset!

In [6]:
# remove all the rows that contain a missing value
nfl_data.dropna()

Out[6]:

Date GameID Drive qtr down time TimeUnder TimeSecs PlayTimeDiff SideofField ... yacEPA Home_WP_pre Away_WP_pre Home_WP_post Away_WP_post Win_Prob WPA airWPA yacWPA S

0 rows × 102 columns

Oh dear, it looks like that's removed all our data! 😱 This is because every row in our dataset had at least one missing value. We might have better luck removing all the columns that have at least one missing value instead.
In [7]:
# remove all columns with at least one missing value
columns_with_na_dropped = nfl_data.dropna(axis=1)
columns_with_na_dropped.head()

Out[7]:

Date GameID Drive qtr TimeUnder ydstogo ydsnet PlayAttempted Yards.Gained sp ... Timeout_Indicator Timeout_Team posteam_timeouts_pre HomeTimeouts_Remaining_Pre AwayTimeouts_Remai
2009-
0 2009091000 1 1 15 0 0 1 39 0 ... 0 None 3 3 3
09-10
2009-
1 2009091000 1 1 15 10 5 1 5 0 ... 0 None 3 3 3
09-10
2009-
2 2009091000 1 1 15 5 2 1 -3 0 ... 0 None 3 3 3
09-10
2009-
3 2009091000 1 1 14 8 2 1 0 0 ... 0 None 3 3 3
09-10
2009-
4 2009091000 1 1 14 8 2 1 0 0 ... 0 None 3 3 3
09-10

5 rows × 41 columns

In [8]:
# just how much data did we lose?
print("Columns in original dataset: %d \n" % nfl_data.shape[1])
print("Columns with na's dropped: %d" % columns_with_na_dropped.shape[1])

Columns in original dataset: 102

Columns with na's dropped: 41

We've lost quite a bit of data, but at this point we have successfully removed all the NaN 's from our data.

Filling in missing values automatically

Another option is to try and fill in the missing values. For this next bit, I'm getting a small sub-section of the NFL data so that it will print well.

In [9]:
# get a small subset of the NFL dataset
subset_nfl_data = nfl_data.loc[:, 'EPA':'Season'].head()
subset_nfl_data

Out[9]:

EPA airEPA yacEPA Home_WP_pre Away_WP_pre Home_WP_post Away_WP_post Win_Prob WPA airWPA yacWPA Season
0 2.014474 NaN NaN 0.485675 0.514325 0.546433 0.453567 0.485675 0.060758 NaN NaN 2009
1 0.077907 -1.068169 1.146076 0.546433 0.453567 0.551088 0.448912 0.546433 0.004655 -0.032244 0.036899 2009
2 -1.402760 NaN NaN 0.551088 0.448912 0.510793 0.489207 0.551088 -0.040295 NaN NaN 2009
3 -1.712583 3.318841 -5.031425 0.510793 0.489207 0.461217 0.538783 0.510793 -0.049576 0.106663 -0.156239 2009
4 2.097796 NaN NaN 0.461217 0.538783 0.558929 0.441071 0.461217 0.097712 NaN NaN 2009

We can use the Panda's fillna() function to fill in missing values in a dataframe for us. One option we have is to specify what we want the NaN values to be replaced with. Here, I'm saying that I would like to replace all the
NaN values with 0.

In [10]:
# replace all NA's with 0
subset_nfl_data.fillna(0)

Out[10]:

EPA airEPA yacEPA Home_WP_pre Away_WP_pre Home_WP_post Away_WP_post Win_Prob WPA airWPA yacWPA Season
0 2.014474 0.000000 0.000000 0.485675 0.514325 0.546433 0.453567 0.485675 0.060758 0.000000 0.000000 2009
1 0.077907 -1.068169 1.146076 0.546433 0.453567 0.551088 0.448912 0.546433 0.004655 -0.032244 0.036899 2009
2 -1.402760 0.000000 0.000000 0.551088 0.448912 0.510793 0.489207 0.551088 -0.040295 0.000000 0.000000 2009
3 -1.712583 3.318841 -5.031425 0.510793 0.489207 0.461217 0.538783 0.510793 -0.049576 0.106663 -0.156239 2009
4 2.097796 0.000000 0.000000 0.461217 0.538783 0.558929 0.441071 0.461217 0.097712 0.000000 0.000000 2009

I could also be a bit more savvy and replace missing values with whatever value comes directly after it in the same column. (This makes a lot of sense for datasets where the observations have some sort of logical order to
them.)
In [11]:
# replace all NA's the value that comes directly after it in the same column,
# then replace all the remaining na's with 0
subset_nfl_data.fillna(method='bfill', axis=0).fillna(0)

Out[11]:

EPA airEPA yacEPA Home_WP_pre Away_WP_pre Home_WP_post Away_WP_post Win_Prob WPA airWPA yacWPA Season
0 2.014474 -1.068169 1.146076 0.485675 0.514325 0.546433 0.453567 0.485675 0.060758 -0.032244 0.036899 2009
1 0.077907 -1.068169 1.146076 0.546433 0.453567 0.551088 0.448912 0.546433 0.004655 -0.032244 0.036899 2009
2 -1.402760 3.318841 -5.031425 0.551088 0.448912 0.510793 0.489207 0.551088 -0.040295 0.106663 -0.156239 2009
3 -1.712583 3.318841 -5.031425 0.510793 0.489207 0.461217 0.538783 0.510793 -0.049576 0.106663 -0.156239 2009
4 2.097796 0.000000 0.000000 0.461217 0.538783 0.558929 0.441071 0.461217 0.097712 0.000000 0.000000 2009

Your turn
Write your own code to deal with missing values (https://www.kaggle.com/kernels/fork/10824396) in a dataset of building permits issued in San Francisco.

Have questions or comments? Visit the course discussion forum (https://www.kaggle.com/learn/data-cleaning/discussion) to chat with other learners.

COPA Demostration Plan
100% (4)
COPA Demostration Plan
452 pages
Data Cleaning With Python and Pandas
No ratings yet
Data Cleaning With Python and Pandas
49 pages
PW2 DataCleaning
No ratings yet
PW2 DataCleaning
6 pages
Handling Missing Values in Python
No ratings yet
Handling Missing Values in Python
9 pages
Unit 3
No ratings yet
Unit 3
30 pages
Lesson 3. Data Preparation and Structuring 1 Data Cleaning
No ratings yet
Lesson 3. Data Preparation and Structuring 1 Data Cleaning
36 pages
DS Lec 6
No ratings yet
DS Lec 6
27 pages
Kenny-230722-Data Cleaning With Python and Pandas - Detecting Missing Values
No ratings yet
Kenny-230722-Data Cleaning With Python and Pandas - Detecting Missing Values
13 pages
Adsl Exp 3 2024
No ratings yet
Adsl Exp 3 2024
11 pages
Pandas
No ratings yet
Pandas
4 pages
Missing Data
No ratings yet
Missing Data
25 pages
Dmdw-Lab Manual
No ratings yet
Dmdw-Lab Manual
61 pages
DWM Exp 7
No ratings yet
DWM Exp 7
4 pages
Code Explanation For Date Types
No ratings yet
Code Explanation For Date Types
8 pages
Chapter3 DS
No ratings yet
Chapter3 DS
17 pages
Data Cleaning
No ratings yet
Data Cleaning
42 pages
Module 3
No ratings yet
Module 3
20 pages
Tutorial 4
No ratings yet
Tutorial 4
8 pages
Lecture 4 New Data Pre Processing
No ratings yet
Lecture 4 New Data Pre Processing
41 pages
Ass-2 Ds
No ratings yet
Ass-2 Ds
29 pages
Dealing With Missing Values
No ratings yet
Dealing With Missing Values
19 pages
FDS Unit 2
No ratings yet
FDS Unit 2
8 pages
Missing Data
No ratings yet
Missing Data
14 pages
Group A Assignment No2 Writeup
No ratings yet
Group A Assignment No2 Writeup
9 pages
Lecture 8 Handling Missing Values
No ratings yet
Lecture 8 Handling Missing Values
25 pages
Exp-12 Iaiml
No ratings yet
Exp-12 Iaiml
13 pages
Handling The Missing Values
No ratings yet
Handling The Missing Values
4 pages
Lab 3 DWM
No ratings yet
Lab 3 DWM
5 pages
DAP Writeups - Merged
No ratings yet
DAP Writeups - Merged
33 pages
AI351 Lecture 1 - Data Preprocessing
No ratings yet
AI351 Lecture 1 - Data Preprocessing
8 pages
Python (Unit - 2)
No ratings yet
Python (Unit - 2)
22 pages
Lec9 Dealing With Missing Values
No ratings yet
Lec9 Dealing With Missing Values
22 pages
TP2 - ML - Handling Outliers
No ratings yet
TP2 - ML - Handling Outliers
5 pages
CH 02 Data Handling Technique
No ratings yet
CH 02 Data Handling Technique
105 pages
Unit 5 Python
No ratings yet
Unit 5 Python
30 pages
Pandas Missing Data
No ratings yet
Pandas Missing Data
30 pages
Unit 4 Notes
No ratings yet
Unit 4 Notes
20 pages
Data Analytics Lab Manual
No ratings yet
Data Analytics Lab Manual
47 pages
Practice 1
No ratings yet
Practice 1
45 pages
03 Data Preprocessing
No ratings yet
03 Data Preprocessing
15 pages
Tutorial-Identifying and Imputation of Missing Values
No ratings yet
Tutorial-Identifying and Imputation of Missing Values
20 pages
Explorotary Data Analysis
100% (1)
Explorotary Data Analysis
30 pages
FDS Chapter 3
No ratings yet
FDS Chapter 3
103 pages
Data Cleaning
No ratings yet
Data Cleaning
20 pages
11-Data Pre-Processing, Exploratory Data Analysis.-23-03-2023
No ratings yet
11-Data Pre-Processing, Exploratory Data Analysis.-23-03-2023
37 pages
Subtitle
No ratings yet
Subtitle
2 pages
Exploratory Data
No ratings yet
Exploratory Data
47 pages
PDS Exp 7 To 9
No ratings yet
PDS Exp 7 To 9
10 pages
"Handling and Mitigation of Missing Data in Sensors" Course: Business Data Mining Group 13
No ratings yet
"Handling and Mitigation of Missing Data in Sensors" Course: Business Data Mining Group 13
12 pages
Missing Data Values and How To Handle It
No ratings yet
Missing Data Values and How To Handle It
5 pages
How To Handle Missing Data in Python. (Explained in 5 Easy Steps)
No ratings yet
How To Handle Missing Data in Python. (Explained in 5 Easy Steps)
10 pages
Summary of The Chapter "Working With Missing Values"
No ratings yet
Summary of The Chapter "Working With Missing Values"
5 pages
Module II - Data Processing
No ratings yet
Module II - Data Processing
54 pages
The Complete Guide To Data Preprocessing
No ratings yet
The Complete Guide To Data Preprocessing
50 pages
Day 19 - Numpy
No ratings yet
Day 19 - Numpy
5 pages
Reading 5 - Data Preparation
No ratings yet
Reading 5 - Data Preparation
23 pages
Week 3
No ratings yet
Week 3
77 pages
DSBDA Lab Assignment No 2
No ratings yet
DSBDA Lab Assignment No 2
7 pages
Phython Example
No ratings yet
Phython Example
12 pages
Python For Beginners
From Everand
Python For Beginners
Célio Azevedo
No ratings yet
Coding Interview Questions and Answers
From Everand
Coding Interview Questions and Answers
Chinmoy Mukherjee
No ratings yet
NPN 2N2222 - 2n2222a PNP 2N2907 - 2n2907a: S I L I C o N P L A N A R e P I T A X I A L T R A N S I S T o R S
No ratings yet
NPN 2N2222 - 2n2222a PNP 2N2907 - 2n2907a: S I L I C o N P L A N A R e P I T A X I A L T R A N S I S T o R S
3 pages
Student Book Touchstone 2
0% (1)
Student Book Touchstone 2
3 pages
How To Log Defects
No ratings yet
How To Log Defects
6 pages
Optima XR200amx: Mobile Digital-Ready Radiographic System
No ratings yet
Optima XR200amx: Mobile Digital-Ready Radiographic System
4 pages
How To Debug The Program in ABAP
No ratings yet
How To Debug The Program in ABAP
4 pages
CSC 111 - Introduction To Computer Science - Corrected Version
No ratings yet
CSC 111 - Introduction To Computer Science - Corrected Version
93 pages
MoA AoA Amended PDF
No ratings yet
MoA AoA Amended PDF
185 pages
Simulia SCN 1306
No ratings yet
Simulia SCN 1306
24 pages
Elite-7x: Operation Manual
No ratings yet
Elite-7x: Operation Manual
0 pages
Jamila Bagdadi - Communication Specialist
No ratings yet
Jamila Bagdadi - Communication Specialist
2 pages
SupremaDM V1.01
0% (1)
SupremaDM V1.01
52 pages
Exploring Reconfigurable Intelligent Surfaces For 6G State of The
No ratings yet
Exploring Reconfigurable Intelligent Surfaces For 6G State of The
14 pages
The Professional Tool For 3 Dimensional Trajectometry Simulations of Rock Falls
No ratings yet
The Professional Tool For 3 Dimensional Trajectometry Simulations of Rock Falls
4 pages
How To Proceed With Troubleshooting: Can Communication - Can Communication System
No ratings yet
How To Proceed With Troubleshooting: Can Communication - Can Communication System
3 pages
Electronic Devices: Floyd
No ratings yet
Electronic Devices: Floyd
40 pages
C Programming Language: Bitwise Structures
No ratings yet
C Programming Language: Bitwise Structures
11 pages
Smartclassroom Report
No ratings yet
Smartclassroom Report
30 pages
Smart Fridge
100% (1)
Smart Fridge
17 pages
XSD (XML Schema Definition) Overview
No ratings yet
XSD (XML Schema Definition) Overview
4 pages
Uwell Caliburn AK2 Replacement Pods - India Vape
No ratings yet
Uwell Caliburn AK2 Replacement Pods - India Vape
1 page
ACP121I
No ratings yet
ACP121I
87 pages
Vsphere Esxi Vcenter Server 703 Authentication Guide
No ratings yet
Vsphere Esxi Vcenter Server 703 Authentication Guide
169 pages
Aditya Resume 2
No ratings yet
Aditya Resume 2
2 pages
CCNA1 Final Exam Answer 2016 v5
No ratings yet
CCNA1 Final Exam Answer 2016 v5
68 pages
RobotStudio 2023-4-1 Release Notes
No ratings yet
RobotStudio 2023-4-1 Release Notes
21 pages
Futaba - Tbs - CRT As9106
No ratings yet
Futaba - Tbs - CRT As9106
2 pages
Тест ИКТ 8
No ratings yet
Тест ИКТ 8
8 pages
Internship Report
100% (1)
Internship Report
58 pages
Lab 3 (Saha)
No ratings yet
Lab 3 (Saha)
7 pages

Handling Missing Values

Uploaded by

Handling Missing Values

Uploaded by

Welcome to the Data Cleaning course on Kaggle Learn!

In this notebook, we'll look at how to deal with missing values.

Take a first look at the data

# read in all our data

# set seed for reproducibility

5 rows × 102 columns

Yep, it looks like there's some missing values.

How many missing data points do we have?

# look at the # of missing points in the first ten columns

# percent of data that is missing

Figure out why the data is missing

Is this value missing because it wasn't recorded or because it doesn't exist?

Drop missing values

0 rows × 102 columns

Columns in original dataset: 102

Columns with na's dropped: 41

Filling in missing values automatically

You might also like