0% found this document useful (0 votes)
10 views33 pages

Pandas 3

Uploaded by

cr.lucianoperez
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
10 views33 pages

Pandas 3

Uploaded by

cr.lucianoperez
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 33

Manipulating Pandas Dataframes

Once you have loaded data into your Pandas dataframe, you might need to
further manipulate the data and perform a variety of functions such as filtering
certain columns, dropping the others, selecting a subset of rows or columns,
sorting the data, finding unique values, and so on. You are going to study all
these functions in this chapter.

You will first see how to select data via indexing and slicing, followed by a
section on how to drop unwanted rows or columns from your data. You will
then study how to filter your desired rows and columns. The chapter
concludes with an explanation of sorting and finding unique values from a
Pandas dataframe.

3.1. Selecting Data Using Indexing and Slicing


Indexing refers to fetching data using index or column information of a Pandas
dataframe. Slicing, on the other hand, refers to slicing a Pandas dataframe
using indexing techniques.

In this section, you will see the different techniques of indexing and slicing
Pandas dataframes.

You will be using the Titanic dataset for this section, which you can import
via the Seaborn library’s load_dataset() method, as shown in the script below.

Script 1:

import matplotlib.pyplot as plt


import seaborn as sns

# sets the default style for plotting


sns.set_style("darkgrid")

titanic_data = sns.load_dataset('titanic')
titanic_data.head()

Output:

3.1.1. Selecting Data Using Brackets []

One of the simplest ways to select data from various columns is by using
square brackets. To get column data in the form of a series from a Pandas
dataframe, you need to pass the column name inside square brackets that
follow the Pandas dataframe name.

The following script selects records from the class column of the Titanic
dataset.

Script 2:

print (titanic_data["class"])
type (titanic_data["class"])

Output:

0 Third
1 First
2 Third
3 First
4 Third

886 Second
887 First
888 Third
889 First
890 Third
Name: class, Length: 891, dtype: category
Categories (3, object): ['First', 'Second', 'Third']
Out[2]:
pandas.core.series.Series

You can select multiple columns by passing a list of column names inside a
string to the square brackets. You will then get a Pandas dataframe with the
specified columns, as shown below.

Script 3:

print (type (titanic_data[["class", "sex", "age"]]))


titanic_data[["class", "sex", "age"]]

Output:

You can also filter rows based on some column values. For doing this, you
need to pass the condition to the filter inside the square brackets. For instance,
the script below returns all records from the Titanic dataset where the sex
column contains the value “male.”

Script 4:
my_df = titanic_data[titanic_data["sex"] == "male"]
my_df.head()

Output:

You can specify multiple conditions inside the square brackets. The following
script returns those records where the sex column contains the string “male,”
while the class column contains the string “First.”

Script 5:

my_df = titanic_data[(titanic_data["sex"] == "male") & (titanic_data["class"] == "First")]


my_df.head()

Output:

You can also use the isin() function to specify a range of values to filter
records. For instance, the script below filters all records where the age
column contains the values 20, 21, or 22.

Script 6:

ages = [ 20 , 21 , 22 ]
age_dataset = titanic_data[titanic_data["age"].isin(ages)]
age_dataset.head()
Output:

3.1.2. Indexing and Slicing Using loc Function

The loc function from the Pandas dataframe can also be used to filter records
in the Pandas dataframe.

To create a dummy dataframe used as an example in this section, run the


following script:

Script 7:

import pandas as pd

scores = [
{'Subject':'Mathematics', 'Score': 85 , 'Grade': 'B', 'Remarks': 'Good', },
{'Subject':'History', 'Score': 98 , 'Grade': 'A','Remarks': 'Excellent'},
{'Subject':'English', 'Score': 76 , 'Grade': 'C','Remarks': 'Fair'},
{'Subject':'Science', 'Score': 72 , 'Grade': 'C','Remarks': 'Fair'},
{'Subject':'Arts', 'Score': 95 , 'Grade': 'A','Remarks': 'Excellent'},
]

my_df = pd.DataFrame(scores)
my_df.head()

Output:
Let’s now see how to filter records. To filter the row at the second index in
the my_dfdataframe, you need to pass 2 inside the square brackets that follow
the loc function. Here is an example:

Script 8:

print (my_df.loc[ 2 ])
type(my_df.loc[ 2 ])

In the output below, you can see data from the row at the second index (row 3)
in the form of a series.

Output:

Subject English
Score 76
Grade C
Remarks Fair
Name: 2, dtype: object
Out[7]:
pandas.core.series.Series

You can also specify the range of indexes to filter records using the loc
function. For instance, the following script filters records from index 2 to 4.

Script 9:

my_df.loc[ 2 : 4 ]
Output:

Along with filtering rows, you can also specify which columns to filter with
the loc function.

The following script filters the values in columns Grade and Score in the
rows from index 2 to 4.

Script 10:

my_df.loc[ 2 : 4 , ["Grade", "Score"]]

Output:

In addition to passing default integer indexes, you can also pass named or
labeled indexes to the loc function.

Let’s create a dataframe with named indexes. Run the following script to do
so:

Script 11:
import pandas as pd

scores = [
{'Subject':'Mathematics', 'Score': 85 , 'Grade': 'B', 'Remarks': 'Good', },
{'Subject':'History', 'Score': 98 , 'Grade': 'A','Remarks': 'Excellent'},
{'Subject':'English', 'Score': 76 , 'Grade': 'C','Remarks': 'Fair'},
{'Subject':'Science', 'Score': 72 , 'Grade': 'C','Remarks': 'Fair'},
{'Subject':'Arts', 'Score': 95 , 'Grade': 'A','Remarks': 'Excellent'},
]

my_df = pd.DataFrame(scores, index = ["Student1", "Student2", "Student3", "Student4", "Student5"])


my_df

From the output below, you can see that the my_dfdataframe now contains
named indexes, e.g., Student1, Student2, etc.

Output:

Let’s now filter a record using Student1 as the index value in the loc function.

Script 12:

my_df.loc["Student1"]

Output:

Subject Mathematics
Score 85
Grade B
Remarks Good
Name: Student1, dtype: object

As shown below, you can specify multiple named indexes in a list to the loc
method. The script below filters records with indexes Student1 and Student2
.

Script 13:

index_list = ["Student1", "Student2"]


my_df.loc[index_list]

Output:

You can also find the value in a particular column while filtering records
using a named index.

The script below returns the value in the Grade column for the record with the
named index Student1 .

Script 14:

my_df.loc["Student1", "Grade"]

Output:

'B'

As you did with the default integer index, you can specify a range of records
using the named indexes within the loc function.
The following function returns values in the Grade column for the indexes
from Student1 to Student2.

Script 15:

my_df.loc["Student1":"Student2", "Grade"]

Output:

Student1 B
Student2 A
Name: Grade, dtype: object

Let’s see another example.

The following function returns values in the Grade column for the indexes
from Student1 to Student4.

Script 16:

my_df.loc["Student1":"Student4", "Grade"]

Output:

Student1 B
Student2 A
Student3 C
Student4 C
Name: Grade, dtype: object

You can also specify a list of Boolean values that correspond to the indexes to
select using the loc method.

For instance, the following script returns only the fourth record since all the
values in the list passed to the loc function are false, except the one at the
fourth index.
Script 17:

my_df.loc[[False, False, False, True, False ]]

Output:

You can also pass dataframe conditions inside the loc method. A condition
returns a boolean value which can be used to index the loc function, as you
have already seen in the previous scripts.

Before you see how loc function uses conditions, let’s see the outcome of a
basic condition in a Pandas dataframe. The script below returns index names
along with True or False values depending on whether the Score column
contains a value greater than 80 or not.

Script 18:

my_df[«Score»]> 80

You can see Boolean values in the output. You can see that indexes Student1,
Student2, and Student5 contain True.

Output:

Student1 True
Student2 True
Student3 False
Student4 False
Student5 True
Name: Score, dtype: bool

Now, let’s pass the condition “my_df["Score"]> 80 ” to the loc function.


Script 19:

my_df.loc[my_df["Score"]> 80 ]

In the output, you can see records with the indexes Student1, Student2, and
Student5.

Output:

You can pass multiple conditions to the loc function. For instance, the script
below returns those rows where the Score column contains a value greater
than 80, and the Remarks column contains the string Excellent.

Script 20:

my_df.loc[(my_df["Score"]> 80 ) & (my_df["Remarks"] == "Excellent")]

Output:

Finally, you can also specify column names to fetch values from, along with a
condition.

For example, the script below returns values from the Score and Grade
columns, where the Score column contains a value greater than 80.
Script 21:

my_df.loc[my_df["Score"]> 80 , ["Score","Grade"]]

Output:

Finally, you can set values for all the columns in a row using the loc function.
For instance, the following script sets values for all the columns for the
record at index Student4 as 90.

Script 22:

my_df.loc["Student4"] = 90
my_df

Output:

3.1.3. Indexing and Slicing Using iloc Function


You can also use the iloc function for selecting and slicing records using index
values. However, unlike the loc function, where you can pass both the string
indexes and integer indexes, you can only pass the integer index values to the
iloc function.

The following script creates a dummy dataframe for this section.

Script 23:

import pandas as pd

scores = [
{'Subject':'Mathematics', 'Score': 85 , 'Grade': 'B', 'Remarks': 'Good', },
{'Subject':'History', 'Score': 98 , 'Grade': 'A','Remarks': 'Excellent'},
{'Subject':'English', 'Score': 76 , 'Grade': 'C','Remarks': 'Fair'},
{'Subject':'Science', 'Score': 72 , 'Grade': 'C','Remarks': 'Fair'},
{'Subject':'Arts', 'Score': 95 , 'Grade': 'A','Remarks': 'Excellent'},
]
my_df = pd.DataFrame(scores)
my_df.head()

Output:

Let’s filter the record at index 3 (row 4).

Script 24:

my_df.iloc[ 3 ]
The script below returns a series.

Output:

Subject Science
Score 72
Grade C
Remarks Fair
Name: 3, dtype: object

If you want to select records from a single column as a dataframe, you need to
specify the index inside the square brackets and then those square brackets
inside the square brackets that follow the iloc function, as shown below.

Script 25:

my_df.iloc[[ 3 ]]

Output:

You can pass multiple indexes to the iloc function to select multiple records.
Here is an example:

Script 26:

my_df.iloc[[ 2 , 3 ]]

Output:
You can also pass a range of indexes. In this case, the records from the lower
range to 1 less than the upper range will be selected.

For instance, the script below returns records from index 2 to index 3 (1 less
than 4).

Script 27:

my_df.iloc[ 2 : 4 ]

Output:

In addition to specifying indexes, you can also pass column numbers (starting
from 0) to the iloc method.

The following script returns values from columns number 0 and 1 for the
records at indexes 2 and 3.

Script 28:

my_df.iloc[[ 2 , 3 ], [ 0 , 1 ]]

Output:
You can also pass a range of indexes and columns to select. The script below
selects columns 1 and 2 and rows 2 and 3.

Script 29:

my_df.iloc[ 2 : 4 , 0 : 2 ]

Output:

3.2. Dropping Rows and Columns with the drop() Method


Apart from selecting columns using the loc and iloc functions, you can also
use the drop() method to drop unwanted rows and columns from your
dataframe while keeping the rest of the rows and columns.

3.2.1. Dropping Rows

The following script creates a dummy dataframe that you will use in this
section.

Script 30:

import pandas as pd

scores = [
{'Subject':'Mathematics', 'Score': 85 , 'Grade': 'B', 'Remarks': 'Good', },
{'Subject':'History', 'Score': 98 , 'Grade': 'A','Remarks': 'Excellent'},
{'Subject':'English', 'Score': 76 , 'Grade': 'C','Remarks': 'Fair'},
{'Subject':'Science', 'Score': 72 , 'Grade': 'C','Remarks': 'Fair'},
{'Subject':'Arts', 'Score': 95 , 'Grade': 'A','Remarks': 'Excellent'},
]

my_df = pd.DataFrame(scores)
my_df.head()

Output:

The following script drops records at indexes 1 and 4.

Script 31:

my_df2 = my_df.drop([ 1 , 4 ])
my_df2.head()

Output:

From the output above, you can see that the indexes are not in sequence since
you have dropped indexes 1 and 4.
You can reset dataframe indexes starting from 0, using the reset_index().

Let’s call the reset_index() method on the my_df2 dataframe. Here, the value
True for the inplace parameter specifies that you want to remove the records
in place without assigning the result to any new variable.

Script 32:

my_df2.reset_index(inplace=True )
my_df2.head()

Output:

The above output shows that the indexes have been reset. Also, you can see
that a new column index has been added, which contains the original index. If
you only want to reset new indexes without creating a new column named
index , you can do so by passing True as the value for the drop parameter of
the reset_index method.

Let’s again drop some rows and reset the index using the reset_index() method
by passing True as the value for the drop attribute. See the following two
scripts:

Script 33:

my_df2 = my_df.drop([ 1 , 4 ])
my_df2.head()

Output:
Script 34:

my_df2.reset_index(inplace=True , drop = True )


my_df2.head()

Output:

By default, the drop method doesn’t drop rows in place. Instead, you have to
assign the result of the drop() method to another variable that contains the
records with dropped results.

For instance, if you drop the records at indexes 1, 3, and 4 using the following
script and then print the dataframe header, you will see that the rows are not
removed from the original dataframe.

Script 35:

my_df.drop([ 1 , 3 , 4 ])
my_df.head()

Output:
If you want to drop rows in place, you need to pass True as the value for the
inplace attribute, as shown in the script below:

Script 36:

my_df.drop([ 1 , 3 , 4 ], inplace = True )


my_df.head()

Output:

3.2.1. Dropping Columns

You can also drop columns using the drop() method.

The following script creates a dummy dataframe for this section.

Script 37:

import pandas as pd

scores = [
{'Subject':'Mathematics', 'Score': 85 , 'Grade': 'B', 'Remarks': 'Good', },
{'Subject':'History', 'Score': 98 , 'Grade': 'A','Remarks': 'Excellent'},
{'Subject':'English', 'Score': 76 , 'Grade': 'C','Remarks': 'Fair'},
{'Subject':'Science', 'Score': 72 , 'Grade': 'C','Remarks': 'Fair'},
{'Subject':'Arts', 'Score': 95 , 'Grade': 'A','Remarks': 'Excellent'},
]

my_df = pd.DataFrame(scores)
my_df.head()

Output:

To drop columns via the drop() method, you need to pass the list of columns
to the drop() method, along with 1 as the value for the axis parameter of the
drop method.

The following script drops the columns Subject and Grade from our dummy
dataframe.

Script 38:

my_df2 = my_df.drop(["Subject", "Grade"], axis = 1 )


my_df2.head()

Output:
You can also drop the columns inplace from a dataframe using the inplace =
True parameter value, as shown in the script below.

Script 39:

my_df.drop(["Subject", "Grade"], axis = 1 , inplace = True)


my_df.head()

Output:

3.3. Filtering Rows and Columns with Filter Method


The drop() method drops the unwanted records, and the filter() method
performs the reverse tasks. It keeps the desired records from a set of records
in a Pandas dataframe.

3.3.1. Filtering Rows

Run the following script to create a dummy dataframe for this section.
Script 40:

import pandas as pd

scores = [
{'Subject':'Mathematics', 'Score': 85 , 'Grade': 'B', 'Remarks': 'Good', },
{'Subject':'History', 'Score': 98 , 'Grade': 'A','Remarks': 'Excellent'},
{'Subject':'English', 'Score': 76 , 'Grade': 'C','Remarks': 'Fair'},
{'Subject':'Science', 'Score': 72 , 'Grade': 'C','Remarks': 'Fair'},
{'Subject':'Arts', 'Score': 95 , 'Grade': 'A','Remarks': 'Excellent'},
]

my_df = pd.DataFrame(scores)
my_df.head()

Output:

To filter rows using the filter() method, you need to pass the list of row
indexes to filter to the filter() method of the Pandas dataframe. Along with
that, you need to pass 0 as the value for the axis attribute of the filter()
method. Here is an example. The script below filters rows with indexes 1, 3,
and 4 from the Pandas dataframe.

Script 41:

my_df2 = my_df.filter([ 1 , 3 , 4 ], axis = 0 )


my_df2.head()

Output:
You can also reset indexes after filtering data using the reset_ index() method,
as shown in the following script:

Script 42:

my_df2 = my_df2.reset_index(drop=True)
my_df2.head()

Output:

3.3.2. Filtering Columns

The dummy dataframe for this section is created using the following script:

Script 43:

import pandas as pd

scores = [
{'Subject':'Mathematics', 'Score': 85 , 'Grade': 'B', 'Remarks': 'Good', },
{'Subject':'History', 'Score': 98 , 'Grade': 'A','Remarks': 'Excellent'},
{'Subject':'English', 'Score': 76 , 'Grade': 'C','Remarks': 'Fair'},
{'Subject':'Science', 'Score': 72 , 'Grade': 'C','Remarks': 'Fair'},
{'Subject':'Arts', 'Score': 95 , 'Grade': 'A','Remarks': 'Excellent'},
]

my_df = pd.DataFrame(scores)
my_df.head()
Output:

To filter columns using the filter() method, you need to pass the list of column
names to the filter method. Furthermore, you need to set 1 as the value for the
axis attribute.

The script below filters the Score and Grade columns from your dummy
dataframe.

Script 44:

my_df2 = my_df.filter(["Score","Grade"], axis = 1 )


my_df2.head()

Output:

3.4. Sorting Dataframes


You can also sort records in your Pandas dataframe based on values in a
particular column. Let’s see how to do this.

For this section, you will be using the Titanic dataset, which you can import
using the Seaborn library using the following script:

Script 45:

import matplotlib.pyplot as plt


import seaborn as sns

# sets the default style for plotting


sns.set_style("darkgrid")

titanic_data = sns.load_dataset('titanic')
titanic_data.head()

Output:

To sort the Pandas dataframe, you can use the sort_values() function of the
Pandas dataframe. The list of columns used for sorting needs to be passed to
the by attribute of the sort_ values() method.

The following script sorts the Titanic dataset in ascending order of the
passenger’s age.

Script 46:

age_sorted_data = titanic_data.sort_values(by=['age'])
age_sorted_data.head()

Output:
To sort by descending order, you need to pass False as the value for the
ascending attribute of the sort_values() function.

The following script sorts the dataset by descending order of age.

Script 47:

age_sorted_data = titanic_data.sort_values(by=['age'], ascending = False)


age_sorted_data.head()

Output:

You can also pass multiple columns to the by attribute of the sort_values()
function. In such a case, the dataset will be sorted by the first column, and in
the case of equal values for two or more records, the dataset will be sorted by
the second column and so on.

The following script first sorts the data by Age and then by Fare, both by
descending orders.

Script 48:

age_sorted_data = titanic_data.sort_values(by=['age','fare'], ascending = False)


age_sorted_data.head()
Output:

3.5. Pandas Unique and Count Functions


In this section, you will see how you can get a list of unique values, the
number of all unique values, and records per unique value from a column in a
Pandas dataframe.

You will be using the Titanic dataset once again, which you download via the
following script.

Script 49:

import matplotlib.pyplot as plt


import seaborn as sns

# sets the default style for plotting


sns.set_style("darkgrid")

titanic_data = sns.load_dataset('titanic')
titanic_data.head()

Output:

To find the number of all the unique values in a column, you can use the
unique() function. The script below returns all the unique values from the
class column from the Titanic dataset.

Script 50:

titanic_data["class"].unique()

Output:

['Third', 'First', 'Second']


Categories (3, object): ['Third', 'First', 'Second']

To get the count of unique values, you can use the nunique() method, as shown
in the script below.

Script 51:

titanic_data["class"].nunique()

Output:

To get the count of non-null values for all the columns in your dataset, you
may call the count() method on the Pandas dataframe. The following script
prints the count of the total number of non-null values in all the columns of the
Titanic dataset.

Script 52:

titanic_data.count()

Output:

survived 891
pclass 891
sex 891
age 714
sibsp 891
parch 891
fare 891
embarked 889
class 891
who 891
adult_male 891
deck 203
embark_town 889
alive 891
alone 891
dtype: int64

Finally, if you want to find the number of records for all the unique values in a
dataframe column, you may use the value_counts() function.

The script below returns counts of records for all the unique values in the
class column.

Script 53:

titanic_data["class"].value_counts()

Output:

Third 491
First 216
Second 184
Name: class, dtype: int64

Further Readings – Pandas Dataframe Manipulation


Check the official documentation here (https://bit.ly/3kguKgb ) to learn
more about the Pandas dataframe manipulation functions.

Hands-on Time – Exercises


Now, it is your turn. Follow the instructions in the exercises below to
check your understanding of Pandas dataframe manipulation techniques that
you learned in this chapter. The answers to these questions are given at the
end of the book.

Exercise 3.1
Question 1:

Which function is used to sort Pandas dataframe by a column value?


A. sort_dataframe()
B. sort_rows()
C. sort_values()
D. sort_records()

Question 2:

To filter columns from a Pandas dataframe, you have to pass a list of column
names to one of the following methods:
A. filter()
B. filter_columns()
C. apply_filter ()
D. None of the above

Question 3:

To drop the second and fourth rows from a Pandas dataframe named my_df,
you can use the following script:
A. my_df.drop([2,4])
B. my_df.drop([1,3])
C. my_df.delete([2,4])
D. my_df.delete([1,3])
Exercise 3.2
From the Titanic dataset, filter all the records where the fare is greater than 20
and the passenger traveled alone. You can access the Titanic dataset using the
following Seaborn command:

import seaborn as sns

titanic_data = sns.load_dataset('titanic')

You might also like