0% found this document useful (0 votes)
16 views21 pages

13 Boost Your Data Analysis With Pandas

Uploaded by

sechiltas
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
16 views21 pages

13 Boost Your Data Analysis With Pandas

Uploaded by

sechiltas
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 21

14.02.

2023 15:42 Boost_your_Data_Analysis_with_Pandas

Open in Colab
(https://colab.research.google.com/github/rmpbastos/data_science/blob/master/_0014_Boost_your_Data_Anal

Boost your Data Analysis with Pandas


Everything you need to know to start increasing your productivity with code
examples

Whether you are building complex Machine Learning models or you just want to organize your monthly
budget in an Excel spreadsheet, you must know how to manipulate and analyze your data.

While many tools can get the job done, today we're going to talk about one of the most used and beginner-
friendly of them all, Pandas (https://pandas.pydata.org/).

https://htmtopdf.herokuapp.com/ipynbviewer/temp/12248d0db2cd2903c91d83d39b20e931/Boost_your_Data_Analysis_with_Pandas.html?t=16… 1/21
14.02.2023 15:42 Boost_your_Data_Analysis_with_Pandas

Why use Pandas?


Pandas is an open-source Python library designed to deal with data analysis and data manipulation. Citing
the official website (https://pandas.pydata.org/),

"pandas is a fast, powerful, flexible and easy to use open source data analysis and manipulation tool,
built on top of the Python programming language."

It is built on top of NumPy (https://numpy.org/) (a Python library for scientific computing) and it has several
functions for cleaning, analyzing, and manipulating data, which can help you extract valuable insights about
your data set. Pandas is great for working with tabular data, as in SQL tables or Excel spreadsheets.

The main data structure in Pandas is a 2-dimensional table called DataFrame. To create a DataFrame, you
can import data in several formats, such as CSV, XLSX, JSON, SQL, to name a few. With some lines of
code, you can add, delete, or edit data in your rows/columns, check your set's statistics, identify and handle
missing entries, etc.

Besides, as stated above, Pandas is widely used and friendly for beginners, which means you'll find a lot of
content online about it and it shouldn't be hard to find answers to your questions.

Getting started
First of all, we need to install Pandas and there are several different environments where you can run it. If
you want to run it directly in your machine, you should take a look at Anaconda (https://www.anaconda.com/),
a distribution aimed at scientific computing that comes with hundreds of pre-installed packages. Anaconda
can be installed in Windows, macOS, and Linux.

However, there is an easier way to get started with Pandas through your browser, using a Jupyter Notebook
(https://jupyter.org/) in the cloud. For instance, you could use IBM Watson Studio
(https://www.ibm.com/cloud/watson-studio) or Google Colab (https://colab.research.google.com/). Both can
be used for free and come with several Python packages pre-installed.

In this article, I am using Google Colab because it is really easy to use out of the box and doesn't require
any previous setups.

Installing and importing Pandas


To install Pandas, you need to write the following command in your environment, depending on your package
manager.

https://htmtopdf.herokuapp.com/ipynbviewer/temp/12248d0db2cd2903c91d83d39b20e931/Boost_your_Data_Analysis_with_Pandas.html?t=16… 2/21
14.02.2023 15:42 Boost_your_Data_Analysis_with_Pandas

In [ ]:

pip install pandas

Requirement already satisfied: pandas in /usr/local/lib/python3.7/dist-pac


kages (1.1.5)
Requirement already satisfied: python-dateutil>=2.7.3 in /usr/local/lib/py
thon3.7/dist-packages (from pandas) (2.8.2)
Requirement already satisfied: numpy>=1.15.4 in /usr/local/lib/python3.7/d
ist-packages (from pandas) (1.19.5)
Requirement already satisfied: pytz>=2017.2 in /usr/local/lib/python3.7/di
st-packages (from pandas) (2018.9)
Requirement already satisfied: six>=1.5 in /usr/local/lib/python3.7/dist-p
ackages (from python-dateutil>=2.7.3->pandas) (1.15.0)

or

In [ ]:

conda install pandas

Please notice that in Google Colab we don't need to use the code above, since Pandas comes pre-installed.

Now, we need to import Pandas, so we can use it in our Jupyter Notebook.

In [47]:

import pandas as pd

We commonly import it "as pd" as a shortcut, so we don't need to write the full word every time we need to
call a Pandas function.

Pandas DataFrame
After installing and importing Pandas, let's see how we can read a file and create a Pandas DataFrame. The
data set we'll be working on in this article is a simplified version of a set provided in a Kaggle competition
(https://www.kaggle.com/c/home-data-for-ml-course/overview) about housing prices.

The file that contains this data set is a .CSV (Comma-separated values). If you want to play with it yourself,
you can find it here
(https://raw.githubusercontent.com/rmpbastos/data_sets/main/kaggle_housing/house_df.csv), in my Github
repository.

Reading files and creating a DataFrame


To read the file into a DataFrame, we just need to input the file path as an argument in the function below:

In [48]:

PATH = 'https://raw.githubusercontent.com/rmpbastos/data_sets/main/kaggle_housing/house
_df.csv'
df = pd.read_csv(PATH)

https://htmtopdf.herokuapp.com/ipynbviewer/temp/12248d0db2cd2903c91d83d39b20e931/Boost_your_Data_Analysis_with_Pandas.html?t=16… 3/21
14.02.2023 15:42 Boost_your_Data_Analysis_with_Pandas

Notice that I used the function read_csv because we are working with a csv file. As mentioned above,
Pandas can handle several file extensions, as you can check here (https://pandas.pydata.org/pandas-
docs/stable/user_guide/io.html).

The function above read the csv file and automatically created a DataFame from it. But if you want to create
a DataFrame from a Python Dict, List, NumPy Array, or even from another DataFrame, you may use the
function below.

In [ ]:

df = pd.DataFrame(mydict)

Let's check the type of the DataFrame we just created.

In [ ]:

type(df)

Out[ ]:

pandas.core.frame.DataFrame

Examining the DataFrame


For the rest of this article, we'll be working with the housing data set mentioned above. The next thing we
should do is taking a look at our DataFrame. We can check the first n entries with the function head. If n is
not provided, we'll see the first 5 rows as default.

In [3]:

df.head()

Out[3]:

Id LotArea Street Neighborhood HouseStyle YearBuilt CentralAir BedroomAbvGr Firep

0 1 8450 Pave CollgCr 2Story 2003 Y 3

1 2 9600 Pave Veenker 1Story 1976 Y 3

2 3 11250 Pave CollgCr 2Story 2001 Y 3

3 4 9550 Pave Crawfor 2Story 1915 Y 3

4 5 14260 Pave NoRidge 2Story 2000 Y 4

At first glance, everything looks fine. We can also check the last entries of the set with tail.

https://htmtopdf.herokuapp.com/ipynbviewer/temp/12248d0db2cd2903c91d83d39b20e931/Boost_your_Data_Analysis_with_Pandas.html?t=16… 4/21
14.02.2023 15:42 Boost_your_Data_Analysis_with_Pandas

In [4]:

df.tail()

Out[4]:

Id LotArea Street Neighborhood HouseStyle YearBuilt CentralAir BedroomAbvGr

1455 1456 7917 Pave Gilbert 2Story 1999 Y 3

1456 1457 13175 Pave NWAmes 1Story 1978 Y 3

1457 1458 9042 Pave Crawfor 2Story 1941 Y 4

1458 1459 9717 Pave NAmes 1Story 1950 Y 2

1459 1460 9937 Pave Edwards 1Story 1965 Y 3

Next, let's check the dimensions of our data using the shape attribute.

In [5]:

df.shape

Out[5]:

(1460, 16)

It returns a tuple with the number of rows and columns. Our DataFrame has 1,460 rows and 16 columns, or
features.

Moving on, we can view a summary of the data set with the function info.

https://htmtopdf.herokuapp.com/ipynbviewer/temp/12248d0db2cd2903c91d83d39b20e931/Boost_your_Data_Analysis_with_Pandas.html?t=16… 5/21
14.02.2023 15:42 Boost_your_Data_Analysis_with_Pandas

In [6]:

df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1460 entries, 0 to 1459
Data columns (total 16 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 Id 1460 non-null int64
1 LotArea 1460 non-null int64
2 Street 1460 non-null object
3 Neighborhood 1460 non-null object
4 HouseStyle 1460 non-null object
5 YearBuilt 1460 non-null int64
6 CentralAir 1460 non-null object
7 BedroomAbvGr 1460 non-null int64
8 Fireplaces 1460 non-null int64
9 GarageType 1379 non-null object
10 GarageYrBlt 1379 non-null float64
11 GarageArea 1460 non-null int64
12 PoolArea 1460 non-null int64
13 PoolQC 7 non-null object
14 Fence 281 non-null object
15 SalePrice 1460 non-null int64
dtypes: float64(1), int64(8), object(7)
memory usage: 182.6+ KB

It shows us useful information about the DataFrame, such as column names, non-null values, dtypes, and
memory usage. From this summary, we can observe that some columns have missing values, a topic we'll
see later.

The following function will give us some descriptive statistics about the dataset.

In [7]:

df.describe()

Out[7]:

Id LotArea YearBuilt BedroomAbvGr Fireplaces GarageYrBlt G

count 1460.000000 1460.000000 1460.000000 1460.000000 1460.000000 1379.000000 1

mean 730.500000 10516.828082 1971.267808 2.866438 0.613014 1978.506164

std 421.610009 9981.264932 30.202904 0.815778 0.644666 24.689725

min 1.000000 1300.000000 1872.000000 0.000000 0.000000 1900.000000

25% 365.750000 7553.500000 1954.000000 2.000000 0.000000 1961.000000

50% 730.500000 9478.500000 1973.000000 3.000000 1.000000 1980.000000

75% 1095.250000 11601.500000 2000.000000 3.000000 1.000000 2002.000000

max 1460.000000 215245.000000 2010.000000 8.000000 3.000000 2010.000000 1

https://htmtopdf.herokuapp.com/ipynbviewer/temp/12248d0db2cd2903c91d83d39b20e931/Boost_your_Data_Analysis_with_Pandas.html?t=16… 6/21
14.02.2023 15:42 Boost_your_Data_Analysis_with_Pandas

This function displays the count, mean, median, standard deviation, and the upper and lower quartiles, as
well as the minimum and maximum values for each feature. Notice that it only shows data about the numeric
features (columns where the data type is int or float).

In the sequence, let me show you one more function, value_counts, before moving on to the next section.

In [8]:

df['Neighborhood'].value_counts()

Out[8]:

NAmes 225
CollgCr 150
OldTown 113
Edwards 100
Somerst 86
Gilbert 79
NridgHt 77
Sawyer 74
NWAmes 73
SawyerW 59
BrkSide 58
Crawfor 51
Mitchel 49
NoRidge 41
Timber 38
IDOTRR 37
ClearCr 28
SWISU 25
StoneBr 25
Blmngtn 17
MeadowV 17
BrDale 16
Veenker 11
NPkVill 9
Blueste 2
Name: Neighborhood, dtype: int64

This function returns Series containing the number of unique values for each column. It can be applied to the
whole DataFrame, but in the example above, we applied it only to the column "Neighborhood".

https://htmtopdf.herokuapp.com/ipynbviewer/temp/12248d0db2cd2903c91d83d39b20e931/Boost_your_Data_Analysis_with_Pandas.html?t=16… 7/21
14.02.2023 15:42 Boost_your_Data_Analysis_with_Pandas

Before we move on, let me summarize each feature of the dataset, for a better understanding.

Id - Unique identification for each row (we'll use it as our index).


LotArea - Lot size in square feet
Street - Type of road access
Neighborhood - Physical location of the house
HouseStyle - Style of residence
YearBuilt - Construction date
CentralAir - Central air conditioning
BedroomAbvGr - Number of bedrooms above basement level
Fireplaces - Number of fireplaces
GarageType - Garage location
GarageYrBlt - Year garage was built
GarageArea - Size of garage in square feet
PoolArea - Pool area in square feet
PoolQC - Pool quality
Fence - Fence quality
SalePrice - House price

Our dataset contains data of different types such as numerical, categorical, boolean, but we won't dive into
these concepts, as they are out of scope here.

Now, let's start manipulating our DataFrame.

Manipulating data with Pandas


Pandas provides us with several tools for handling data. In this section, we'll see how to manipulate rows and
columns as well as locate and edit values in our table. Let's start setting an index for our DataFrame.

DataFrame Index
After checking our data, we noticed that the first column (Id) has a unique value for each row. We can take
advantage of it and use this column as our index, in place of the index created by default when we set up the
DataFrame.

In [49]:

df.set_index('Id', inplace=True)

In [10]:

df.index

Out[10]:

Int64Index([ 1, 2, 3, 4, 5, 6, 7, 8, 9, 10,
...
1451, 1452, 1453, 1454, 1455, 1456, 1457, 1458, 1459, 1460],
dtype='int64', name='Id', length=1460)

https://htmtopdf.herokuapp.com/ipynbviewer/temp/12248d0db2cd2903c91d83d39b20e931/Boost_your_Data_Analysis_with_Pandas.html?t=16… 8/21
14.02.2023 15:42 Boost_your_Data_Analysis_with_Pandas

When setting the argument inplace as True the DataFrame will be updated in place. Otherwise, using inplace
= False, which is the default value, would return a copy of the DataFrame.

If you know beforehand that you are going to use a column in your data set as the index, you can set it up
when reading the file, as below.

In [ ]:

df = pd.read_csv(PATH, index_col='Id')

Let's check out how the set looks after setting the index.

In [11]:

df.head()

Out[11]:

LotArea Street Neighborhood HouseStyle YearBuilt CentralAir BedroomAbvGr Fireplac

Id

1 8450 Pave CollgCr 2Story 2003 Y 3

2 9600 Pave Veenker 1Story 1976 Y 3

3 11250 Pave CollgCr 2Story 2001 Y 3

4 9550 Pave Crawfor 2Story 1915 Y 3

5 14260 Pave NoRidge 2Story 2000 Y 4

The data set looks a bit cleaner now! Moving on, let's talk about rows and columns.

Rows and Columns


As you have noticed, data frames are tabular data, containing rows and columns. In Pandas, a single column
can be called Series. We can easily check the columns and access them with the codes below.

In [ ]:

df.columns

Out[ ]:

Index(['LotArea', 'Street', 'Neighborhood', 'HouseStyle', 'YearBuilt',


'CentralAir', 'BedroomAbvGr', 'Fireplaces', 'GarageType', 'GarageYr
Blt',
'GarageArea', 'PoolArea', 'PoolQC', 'Fence', 'SalePrice'],
dtype='object')

https://htmtopdf.herokuapp.com/ipynbviewer/temp/12248d0db2cd2903c91d83d39b20e931/Boost_your_Data_Analysis_with_Pandas.html?t=16… 9/21
14.02.2023 15:42 Boost_your_Data_Analysis_with_Pandas

In [ ]:

df['LotArea'].head()

Out[ ]:

Id
1 8450
2 9600
3 11250
4 9550
5 14260
Name: LotArea, dtype: int64

In [ ]:

type(df['LotArea'])

Out[ ]:

pandas.core.series.Series

Notice that a column in our DataFrame is of type Series.

Renaming columns

It is really simple to rename your columns with Pandas. For instance, let's take our feature BedroomAbvGr,
and rename it to Bedroom.

In [50]:

df.rename(columns={'BedroomAbvGr': 'Bedroom'}, inplace=True)

You can rename several columns at once. Just add all "old names" and "new names" as key/value pairs in
the columns dictionary, inside the function rename.

Adding columns

You might want to add a column to your DataFrame. Let's see how you can do that.

I'll take the opportunity to show you how we can create a copy of our DataFrame. Let's add a column to our
copy so we don't change the original DataFrame.

In [ ]:

df_copy = df.copy()

In [ ]:

df_copy['Sold'] = 'N'

This is the easiest way to create a new column. Notice that I assigned a value N for all entries in this column.

https://htmtopdf.herokuapp.com/ipynbviewer/temp/12248d0db2cd2903c91d83d39b20e931/Boost_your_Data_Analysis_with_Pandas.html?t=1… 10/21
14.02.2023 15:42 Boost_your_Data_Analysis_with_Pandas

In [ ]:

df_copy.tail()

Out[ ]:

LotArea Street Neighborhood HouseStyle YearBuilt CentralAir Bedroom Fireplaces

Id

1456 7917 Pave Gilbert 2Story 1999 Y 3 1

1457 13175 Pave NWAmes 1Story 1978 Y 3 2

1458 9042 Pave Crawfor 2Story 1941 Y 4 2

1459 9717 Pave NAmes 1Story 1950 Y 2 0

1460 9937 Pave Edwards 1Story 1965 Y 3 0

Adding rows

Now, let's suppose you have another DataFrame (called df_to_append) containing 2 rows that you want to
add to df_copy. One way to append rows to the end of a DataFrame is with the function append.

In [ ]:

data_to_append = {'LotArea': [9500, 15000],


'Steet': ['Pave', 'Gravel'],
'Neighborhood': ['Downtown', 'Downtown'],
'HouseStyle': ['2Story', '1Story'],
'YearBuilt': [2021, 2019],
'CentralAir': ['Y', 'N'],
'Bedroom': [5, 4],
'Fireplaces': [1, 0],
'GarageType': ['Attchd', 'Attchd'],
'GarageYrBlt': [2021, 2019],
'GarageArea': [300, 250],
'PoolArea': [0, 0],
'PoolQC': ['G', 'G'],
'Fence': ['G', 'G'],
'SalePrice': [250000, 195000],
'Sold': ['Y', 'Y']}

df_to_append = pd.DataFrame(data_to_append)

In [ ]:

df_to_append

Out[ ]:

LotArea Steet Neighborhood HouseStyle YearBuilt CentralAir Bedroom Fireplaces Ga

0 9500 Pave Downtown 2Story 2021 Y 5 1

1 15000 Gravel Downtown 1Story 2019 N 4 0

https://htmtopdf.herokuapp.com/ipynbviewer/temp/12248d0db2cd2903c91d83d39b20e931/Boost_your_Data_Analysis_with_Pandas.html?t=16… 11/21
14.02.2023 15:42 Boost_your_Data_Analysis_with_Pandas

Now, let's append the 2-row DataFrame above to df_copy.

In [ ]:

df_copy = df_copy.append(df_to_append, ignore_index=True)

Checking the last entries of df_copy we have:

In [ ]:

df_copy.tail(3)

Out[ ]:

LotArea Street Neighborhood HouseStyle YearBuilt CentralAir Bedroom Fireplaces

1459 9937 Pave Edwards 1Story 1965 Y 3 0

1460 9500 NaN Downtown 2Story 2021 Y 5 1

1461 15000 NaN Downtown 1Story 2019 N 4 0

Removing rows and columns

To eliminate rows and columns of a DataFrame, we can use the function drop. Let's assume we want to
delete the last row and the column 'Fence'. Check out the codes below.

In [ ]:

df_copy.drop(labels=1461, axis=0, inplace=True)

The function above removed the last row (the one with Id 1461). You can also drop several rows at once,
passing a list of indexes as an argument.

The axis=0, which is the default value, means that you are removing a row. For columns, we need to specify
that axis=1, as below.

In [ ]:

df_copy.drop(labels='Fence', axis=1, inplace=True)

Selecting data with loc and iloc


One of the easiest ways to select data is with the methods loc and iloc.

loc is used to access rows and columns by label/index or based on a boolean array. Imagine we want to
access the row with index = 1000.

https://htmtopdf.herokuapp.com/ipynbviewer/temp/12248d0db2cd2903c91d83d39b20e931/Boost_your_Data_Analysis_with_Pandas.html?t=1… 12/21
14.02.2023 15:42 Boost_your_Data_Analysis_with_Pandas

In [ ]:

df.loc[1000]

Out[ ]:

LotArea 6762
Street Pave
Neighborhood CollgCr
HouseStyle 1Story
YearBuilt 2006
CentralAir Y
Bedroom 2
Fireplaces 0
GarageType Attchd
GarageYrBlt 2006
GarageArea 632
PoolArea 0
PoolQC NaN
Fence NaN
SalePrice 206000
Name: 1000, dtype: object

The method above selected the row with index = 1000 and displayed all data contained in this row. We can
also select which columns we want to visualize.

In [ ]:

df.loc[1000, ['LotArea', 'SalePrice']]

Out[ ]:

LotArea 6762
SalePrice 206000
Name: 1000, dtype: object

Now, let's see how we can apply a condition to loc. Imagine we want to select all houses that have a sale
price of at least $600,000.

In [ ]:

df.loc[df['SalePrice'] >= 600000]

Out[ ]:

LotArea Street Neighborhood HouseStyle YearBuilt CentralAir Bedroom Fireplaces

Id

692 21535 Pave NoRidge 2Story 1994 Y 4 2

899 12919 Pave NridgHt 1Story 2009 Y 2 2

1170 35760 Pave NoRidge 2Story 1995 Y 4 1

1183 15623 Pave NoRidge 2Story 1996 Y 4 2

With a simple line of code, we found only 4 houses worth over $600,000.

https://htmtopdf.herokuapp.com/ipynbviewer/temp/12248d0db2cd2903c91d83d39b20e931/Boost_your_Data_Analysis_with_Pandas.html?t=1… 13/21
14.02.2023 15:42 Boost_your_Data_Analysis_with_Pandas

iloc is used to select data based on their integer location or based on a boolean array as well. For instance, if
we want to select the data contained in the first row and the first column, we have the following:

In [ ]:

df.iloc[0,0]

Out[ ]:

8450

The value displayed is the LotArea of the row where ID is 1. Remember that the integer location is zero-
based.

We can also select an entire row. In this case, the row is in position 10

In [ ]:

df.iloc[10,:]

Out[ ]:

LotArea 11200
Street Pave
Neighborhood Sawyer
HouseStyle 1Story
YearBuilt 1965
CentralAir Y
Bedroom 3
Fireplaces 0
GarageType Detchd
GarageYrBlt 1965
GarageArea 384
PoolArea 0
PoolQC NaN
Fence NaN
SalePrice 129500
Name: 11, dtype: object

We can select an entire column, for instance, the last column.

https://htmtopdf.herokuapp.com/ipynbviewer/temp/12248d0db2cd2903c91d83d39b20e931/Boost_your_Data_Analysis_with_Pandas.html?t=1… 14/21
14.02.2023 15:42 Boost_your_Data_Analysis_with_Pandas

In [ ]:

df.iloc[:,-1]

Out[ ]:

Id
1 208500
2 181500
3 223500
4 140000
5 250000
...
1456 175000
1457 210000
1458 266500
1459 142125
1460 147500
Name: SalePrice, Length: 1460, dtype: int64

And we can also multiple rows and columns, as below.

In [ ]:

df.iloc[8:12, 2:5]

Out[ ]:

Neighborhood HouseStyle YearBuilt

Id

9 OldTown 1.5Fin 1931

10 BrkSide 1.5Unf 1939

11 Sawyer 1Story 1965

12 NridgHt 2Story 2005

Handling missing values


A lot can be talked about dealing with missing values. Please bear in mind that the goal here is not to go
deep into the subject, but to show you the tools provided by Pandas to handle missing values.

Detecting missing values


The first step here is to find the values that are missing in your data set with the function isnull. This function
will return an object with the same size as the original DataFrame, containing boolean values for each
element in the set. It will consider as True values such as None and numpy.NaN. You can find them with the
following line of code.

https://htmtopdf.herokuapp.com/ipynbviewer/temp/12248d0db2cd2903c91d83d39b20e931/Boost_your_Data_Analysis_with_Pandas.html?t=1… 15/21
14.02.2023 15:42 Boost_your_Data_Analysis_with_Pandas

In [39]:

df.isnull()

Out[39]:

LotArea Street Neighborhood HouseStyle YearBuilt CentralAir Bedroom Fireplaces

Id

1 False False False False False False False False

2 False False False False False False False False

3 False False False False False False False False

4 False False False False False False False False

5 False False False False False False False False

... ... ... ... ... ... ... ... ...

1456 False False False False False False False False

1457 False False False False False False False False

1458 False False False False False False False False

1459 False False False False False False False False

1460 False False False False False False False False

1460 rows × 15 columns

Notice that it can be cumbersome to work with the data returned above. If you are working with a really small
data set you should be fine, but with thousands of rows and several columns, as in our case, we can only
add the number of missing values per column, as follows.

In [68]:

df.isnull().sum()

Out[68]:

LotArea 0
Street 0
Neighborhood 0
HouseStyle 0
YearBuilt 0
CentralAir 0
Bedroom 0
Fireplaces 0
GarageType 81
GarageYrBlt 81
GarageArea 0
PoolArea 0
PoolQC 1453
Fence 1179
SalePrice 0
dtype: int64

https://htmtopdf.herokuapp.com/ipynbviewer/temp/12248d0db2cd2903c91d83d39b20e931/Boost_your_Data_Analysis_with_Pandas.html?t=1… 16/21
14.02.2023 15:42 Boost_your_Data_Analysis_with_Pandas

Much better now! We can easily visualize the number of missing values for each column. We could also
realize that most columns are complete, which is great. The sum function added all values returned as True
by isnull because they are equivalent to 1. False values are equivalent to 0.

We can also check the proportion of missing values for each column:

In [41]:

df.isnull().sum() / df.shape[0]

Out[41]:

LotArea 0.000000
Street 0.000000
Neighborhood 0.000000
HouseStyle 0.000000
YearBuilt 0.000000
CentralAir 0.000000
Bedroom 0.000000
Fireplaces 0.000000
GarageType 0.055479
GarageYrBlt 0.055479
GarageArea 0.000000
PoolArea 0.000000
PoolQC 0.995205
Fence 0.807534
SalePrice 0.000000
dtype: float64

Let's take a step further and use Python to get only the columns with missing values and display the
percentage of values that are missing.

In [42]:

for column in df.columns:


if df[column].isnull().sum() > 0:
print(column, ': {:.2%}'.format(df[column].isnull().sum() / df[column].shape[0]))

GarageType : 5.55%
GarageYrBlt : 5.55%
PoolQC : 99.52%
Fence : 80.75%

Removing missing values


After detecting the missing values, we need to decide what to do with them. Here, I'll show you how to
eliminate missing values.

We should be very cautious before removing a whole column or row because we are taking data out of our
data set and it can harm your analysis.

First, let's think about the feature PoolQC. Since over 99% of the values are missing in this column, we are
going to remove it. As we already saw in an earlier section, we can drop a column with the function drop.

Here, I'll use a copy of the original DataFrame.

https://htmtopdf.herokuapp.com/ipynbviewer/temp/12248d0db2cd2903c91d83d39b20e931/Boost_your_Data_Analysis_with_Pandas.html?t=1… 17/21
14.02.2023 15:42 Boost_your_Data_Analysis_with_Pandas

In [51]:

df_toremove = df.copy()

In [53]:

df_toremove.drop(labels='PoolQC', axis=1, inplace=True)

Now, let's take a look at GarageType. Since it only has about 5% of values missing, we can simply remove
the rows where we have missing values for this feature, using the function dropna.

In [55]:

df_toremove.dropna(subset=['GarageType'], axis=0, inplace=True)

Filling missing values


When we are dealing with missing values, besides removing them we can fill these missing data with some
non-null value. There are several techniques to help you determine which values you should insert in your
data, including using machine learning, and I really recommend you search for articles about this topic, but
here I'm only showing the tools provided by Pandas to do the job.

First, let's take a look at the feature Fence. Notice that it has 80% of missing values. Suppose it happened
because these houses just don't have fences! So, we are filling these missing data with the string NoFence.

Once again, I'll use a copy of the original DataFrame.

In [57]:

df_tofill = df.copy()

In [59]:

df_tofill['Fence'].fillna(value='NoFence', inplace=True)

Now, let's check the feature GarageYrBlt. In this example, I'll show you how to fill the entries with the median
value for the column.

In [66]:

garage_median = df_tofill['GarageYrBlt'].median()
df_tofill.fillna({'GarageYrBlt': garage_median}, inplace=True)

Let me remember you that these examples are just for educational purposes. You might have realized that
GarageType and GarageYrBlt had 81 missing values. It's probably because those houses don't have any
garages. Removing these rows for missing GarageType and filling GarageYrBlt with some value might not be
the smartest thing to do in a real-life analysis. In fact, if I wasn't using copies of the original DataFrame, you
would see that when we dropped the rows where GarageType was missing, those were the same 81 rows
where GarageYrBlt was missing as well. This shows the importance of interpreting and knowing your data.

https://htmtopdf.herokuapp.com/ipynbviewer/temp/12248d0db2cd2903c91d83d39b20e931/Boost_your_Data_Analysis_with_Pandas.html?t=1… 18/21
14.02.2023 15:42 Boost_your_Data_Analysis_with_Pandas

Visualizing data
In this section, I'll talk about how to do some simple plotting with Pandas. If you want to build more elaborate
charts, I recommend you take a look at 2 other Python libraries: Matplotlib (https://matplotlib.org/) and
Seaborn (https://seaborn.pydata.org/).

Here, we'll see two types of charts: histogram and scatter plot.

Histograms
Histograms are great to display the distribution of data. Below is a histogram of the feature SalePrice, where
the x-axis contains bins that divide the values into intervals, and the y-axis is for the frequency.

In [83]:

df['SalePrice'].plot(kind='hist');

Scatter plots
With a scatter plot you can visualize the relationship between two variables. The chart is built using cartesian
coordinates (https://en.wikipedia.org/wiki/Cartesian_coordinate_system) to display the values as a collection
of points, each having the value of one variable determining the position on the x-axis and the value of the
other variable determining the position on the y-axis.

Let's build a scatter plot for the variables SalePrice and YearBuilt to check if there is any relationship
between them.

https://htmtopdf.herokuapp.com/ipynbviewer/temp/12248d0db2cd2903c91d83d39b20e931/Boost_your_Data_Analysis_with_Pandas.html?t=1… 19/21
14.02.2023 15:42 Boost_your_Data_Analysis_with_Pandas

In [87]:

df.plot(x='SalePrice', y='YearBuilt', kind='scatter')

Out[87]:

<matplotlib.axes._subplots.AxesSubplot at 0x7f72eca45a10>

Well, we can see that there is a small positive relationship between the sale price and the year of
construction.

The plot function also supports many other kinds of charts, such as line, bar, area, pie, etc.

Saving to file
In the last section of this article, let's see how to save our DataFrame as a file.

Just like when we read the csv file to create our DataFrame, we can also save our DataFrame in various
formats (https://pandas.pydata.org/pandas-docs/stable/user_guide/io.html). If you are using Google Colab,
you can simply write the following line of code, to save a csv file.

In [88]:

df.to_csv('My_DataFrame.csv')

You can also specify a path in your machine. In windows, for instance:

In [93]:

df.to_csv('C:/Users/username/Documents/My_DataFrame.csv')

https://htmtopdf.herokuapp.com/ipynbviewer/temp/12248d0db2cd2903c91d83d39b20e931/Boost_your_Data_Analysis_with_Pandas.html?t=1… 20/21
14.02.2023 15:42 Boost_your_Data_Analysis_with_Pandas

Conclusion
I hope this article helped you get a grasp of what you can do with Pandas. After some practice, it gets quite
effortless to manipulate your data.

Pandas is widely used in data science. Many data scientists make use of it to manipulate data before building
machine learning models, but you can benefit from it even in simpler tasks you would do in Excel.

Nowadays, at work, many of the activities I used to do in excel I'm doing with Python and Pandas. Maybe the
learning curve is a little steeper, but the potential increase in your productivity pays off, not to mention that
Python is an excellent tool to have under your belt!

https://htmtopdf.herokuapp.com/ipynbviewer/temp/12248d0db2cd2903c91d83d39b20e931/Boost_your_Data_Analysis_with_Pandas.html?t=1… 21/21

You might also like