Pandas 3
Pandas 3
Once you have loaded data into your Pandas dataframe, you might need to
further manipulate the data and perform a variety of functions such as filtering
certain columns, dropping the others, selecting a subset of rows or columns,
sorting the data, finding unique values, and so on. You are going to study all
these functions in this chapter.
You will first see how to select data via indexing and slicing, followed by a
section on how to drop unwanted rows or columns from your data. You will
then study how to filter your desired rows and columns. The chapter
concludes with an explanation of sorting and finding unique values from a
Pandas dataframe.
In this section, you will see the different techniques of indexing and slicing
Pandas dataframes.
You will be using the Titanic dataset for this section, which you can import
via the Seaborn library’s load_dataset() method, as shown in the script below.
Script 1:
titanic_data = sns.load_dataset('titanic')
titanic_data.head()
Output:
One of the simplest ways to select data from various columns is by using
square brackets. To get column data in the form of a series from a Pandas
dataframe, you need to pass the column name inside square brackets that
follow the Pandas dataframe name.
The following script selects records from the class column of the Titanic
dataset.
Script 2:
print (titanic_data["class"])
type (titanic_data["class"])
Output:
0 Third
1 First
2 Third
3 First
4 Third
…
886 Second
887 First
888 Third
889 First
890 Third
Name: class, Length: 891, dtype: category
Categories (3, object): ['First', 'Second', 'Third']
Out[2]:
pandas.core.series.Series
You can select multiple columns by passing a list of column names inside a
string to the square brackets. You will then get a Pandas dataframe with the
specified columns, as shown below.
Script 3:
Output:
You can also filter rows based on some column values. For doing this, you
need to pass the condition to the filter inside the square brackets. For instance,
the script below returns all records from the Titanic dataset where the sex
column contains the value “male.”
Script 4:
my_df = titanic_data[titanic_data["sex"] == "male"]
my_df.head()
Output:
You can specify multiple conditions inside the square brackets. The following
script returns those records where the sex column contains the string “male,”
while the class column contains the string “First.”
Script 5:
Output:
You can also use the isin() function to specify a range of values to filter
records. For instance, the script below filters all records where the age
column contains the values 20, 21, or 22.
Script 6:
ages = [ 20 , 21 , 22 ]
age_dataset = titanic_data[titanic_data["age"].isin(ages)]
age_dataset.head()
Output:
The loc function from the Pandas dataframe can also be used to filter records
in the Pandas dataframe.
Script 7:
import pandas as pd
scores = [
{'Subject':'Mathematics', 'Score': 85 , 'Grade': 'B', 'Remarks': 'Good', },
{'Subject':'History', 'Score': 98 , 'Grade': 'A','Remarks': 'Excellent'},
{'Subject':'English', 'Score': 76 , 'Grade': 'C','Remarks': 'Fair'},
{'Subject':'Science', 'Score': 72 , 'Grade': 'C','Remarks': 'Fair'},
{'Subject':'Arts', 'Score': 95 , 'Grade': 'A','Remarks': 'Excellent'},
]
my_df = pd.DataFrame(scores)
my_df.head()
Output:
Let’s now see how to filter records. To filter the row at the second index in
the my_dfdataframe, you need to pass 2 inside the square brackets that follow
the loc function. Here is an example:
Script 8:
print (my_df.loc[ 2 ])
type(my_df.loc[ 2 ])
In the output below, you can see data from the row at the second index (row 3)
in the form of a series.
Output:
Subject English
Score 76
Grade C
Remarks Fair
Name: 2, dtype: object
Out[7]:
pandas.core.series.Series
You can also specify the range of indexes to filter records using the loc
function. For instance, the following script filters records from index 2 to 4.
Script 9:
my_df.loc[ 2 : 4 ]
Output:
Along with filtering rows, you can also specify which columns to filter with
the loc function.
The following script filters the values in columns Grade and Score in the
rows from index 2 to 4.
Script 10:
Output:
In addition to passing default integer indexes, you can also pass named or
labeled indexes to the loc function.
Let’s create a dataframe with named indexes. Run the following script to do
so:
Script 11:
import pandas as pd
scores = [
{'Subject':'Mathematics', 'Score': 85 , 'Grade': 'B', 'Remarks': 'Good', },
{'Subject':'History', 'Score': 98 , 'Grade': 'A','Remarks': 'Excellent'},
{'Subject':'English', 'Score': 76 , 'Grade': 'C','Remarks': 'Fair'},
{'Subject':'Science', 'Score': 72 , 'Grade': 'C','Remarks': 'Fair'},
{'Subject':'Arts', 'Score': 95 , 'Grade': 'A','Remarks': 'Excellent'},
]
From the output below, you can see that the my_dfdataframe now contains
named indexes, e.g., Student1, Student2, etc.
Output:
Let’s now filter a record using Student1 as the index value in the loc function.
Script 12:
my_df.loc["Student1"]
Output:
Subject Mathematics
Score 85
Grade B
Remarks Good
Name: Student1, dtype: object
As shown below, you can specify multiple named indexes in a list to the loc
method. The script below filters records with indexes Student1 and Student2
.
Script 13:
Output:
You can also find the value in a particular column while filtering records
using a named index.
The script below returns the value in the Grade column for the record with the
named index Student1 .
Script 14:
my_df.loc["Student1", "Grade"]
Output:
'B'
As you did with the default integer index, you can specify a range of records
using the named indexes within the loc function.
The following function returns values in the Grade column for the indexes
from Student1 to Student2.
Script 15:
my_df.loc["Student1":"Student2", "Grade"]
Output:
Student1 B
Student2 A
Name: Grade, dtype: object
The following function returns values in the Grade column for the indexes
from Student1 to Student4.
Script 16:
my_df.loc["Student1":"Student4", "Grade"]
Output:
Student1 B
Student2 A
Student3 C
Student4 C
Name: Grade, dtype: object
You can also specify a list of Boolean values that correspond to the indexes to
select using the loc method.
For instance, the following script returns only the fourth record since all the
values in the list passed to the loc function are false, except the one at the
fourth index.
Script 17:
Output:
You can also pass dataframe conditions inside the loc method. A condition
returns a boolean value which can be used to index the loc function, as you
have already seen in the previous scripts.
Before you see how loc function uses conditions, let’s see the outcome of a
basic condition in a Pandas dataframe. The script below returns index names
along with True or False values depending on whether the Score column
contains a value greater than 80 or not.
Script 18:
my_df[«Score»]> 80
You can see Boolean values in the output. You can see that indexes Student1,
Student2, and Student5 contain True.
Output:
Student1 True
Student2 True
Student3 False
Student4 False
Student5 True
Name: Score, dtype: bool
my_df.loc[my_df["Score"]> 80 ]
In the output, you can see records with the indexes Student1, Student2, and
Student5.
Output:
You can pass multiple conditions to the loc function. For instance, the script
below returns those rows where the Score column contains a value greater
than 80, and the Remarks column contains the string Excellent.
Script 20:
Output:
Finally, you can also specify column names to fetch values from, along with a
condition.
For example, the script below returns values from the Score and Grade
columns, where the Score column contains a value greater than 80.
Script 21:
my_df.loc[my_df["Score"]> 80 , ["Score","Grade"]]
Output:
Finally, you can set values for all the columns in a row using the loc function.
For instance, the following script sets values for all the columns for the
record at index Student4 as 90.
Script 22:
my_df.loc["Student4"] = 90
my_df
Output:
Script 23:
import pandas as pd
scores = [
{'Subject':'Mathematics', 'Score': 85 , 'Grade': 'B', 'Remarks': 'Good', },
{'Subject':'History', 'Score': 98 , 'Grade': 'A','Remarks': 'Excellent'},
{'Subject':'English', 'Score': 76 , 'Grade': 'C','Remarks': 'Fair'},
{'Subject':'Science', 'Score': 72 , 'Grade': 'C','Remarks': 'Fair'},
{'Subject':'Arts', 'Score': 95 , 'Grade': 'A','Remarks': 'Excellent'},
]
my_df = pd.DataFrame(scores)
my_df.head()
Output:
Script 24:
my_df.iloc[ 3 ]
The script below returns a series.
Output:
Subject Science
Score 72
Grade C
Remarks Fair
Name: 3, dtype: object
If you want to select records from a single column as a dataframe, you need to
specify the index inside the square brackets and then those square brackets
inside the square brackets that follow the iloc function, as shown below.
Script 25:
my_df.iloc[[ 3 ]]
Output:
You can pass multiple indexes to the iloc function to select multiple records.
Here is an example:
Script 26:
my_df.iloc[[ 2 , 3 ]]
Output:
You can also pass a range of indexes. In this case, the records from the lower
range to 1 less than the upper range will be selected.
For instance, the script below returns records from index 2 to index 3 (1 less
than 4).
Script 27:
my_df.iloc[ 2 : 4 ]
Output:
In addition to specifying indexes, you can also pass column numbers (starting
from 0) to the iloc method.
The following script returns values from columns number 0 and 1 for the
records at indexes 2 and 3.
Script 28:
my_df.iloc[[ 2 , 3 ], [ 0 , 1 ]]
Output:
You can also pass a range of indexes and columns to select. The script below
selects columns 1 and 2 and rows 2 and 3.
Script 29:
my_df.iloc[ 2 : 4 , 0 : 2 ]
Output:
The following script creates a dummy dataframe that you will use in this
section.
Script 30:
import pandas as pd
scores = [
{'Subject':'Mathematics', 'Score': 85 , 'Grade': 'B', 'Remarks': 'Good', },
{'Subject':'History', 'Score': 98 , 'Grade': 'A','Remarks': 'Excellent'},
{'Subject':'English', 'Score': 76 , 'Grade': 'C','Remarks': 'Fair'},
{'Subject':'Science', 'Score': 72 , 'Grade': 'C','Remarks': 'Fair'},
{'Subject':'Arts', 'Score': 95 , 'Grade': 'A','Remarks': 'Excellent'},
]
my_df = pd.DataFrame(scores)
my_df.head()
Output:
Script 31:
my_df2 = my_df.drop([ 1 , 4 ])
my_df2.head()
Output:
From the output above, you can see that the indexes are not in sequence since
you have dropped indexes 1 and 4.
You can reset dataframe indexes starting from 0, using the reset_index().
Let’s call the reset_index() method on the my_df2 dataframe. Here, the value
True for the inplace parameter specifies that you want to remove the records
in place without assigning the result to any new variable.
Script 32:
my_df2.reset_index(inplace=True )
my_df2.head()
Output:
The above output shows that the indexes have been reset. Also, you can see
that a new column index has been added, which contains the original index. If
you only want to reset new indexes without creating a new column named
index , you can do so by passing True as the value for the drop parameter of
the reset_index method.
Let’s again drop some rows and reset the index using the reset_index() method
by passing True as the value for the drop attribute. See the following two
scripts:
Script 33:
my_df2 = my_df.drop([ 1 , 4 ])
my_df2.head()
Output:
Script 34:
Output:
By default, the drop method doesn’t drop rows in place. Instead, you have to
assign the result of the drop() method to another variable that contains the
records with dropped results.
For instance, if you drop the records at indexes 1, 3, and 4 using the following
script and then print the dataframe header, you will see that the rows are not
removed from the original dataframe.
Script 35:
my_df.drop([ 1 , 3 , 4 ])
my_df.head()
Output:
If you want to drop rows in place, you need to pass True as the value for the
inplace attribute, as shown in the script below:
Script 36:
Output:
Script 37:
import pandas as pd
scores = [
{'Subject':'Mathematics', 'Score': 85 , 'Grade': 'B', 'Remarks': 'Good', },
{'Subject':'History', 'Score': 98 , 'Grade': 'A','Remarks': 'Excellent'},
{'Subject':'English', 'Score': 76 , 'Grade': 'C','Remarks': 'Fair'},
{'Subject':'Science', 'Score': 72 , 'Grade': 'C','Remarks': 'Fair'},
{'Subject':'Arts', 'Score': 95 , 'Grade': 'A','Remarks': 'Excellent'},
]
my_df = pd.DataFrame(scores)
my_df.head()
Output:
To drop columns via the drop() method, you need to pass the list of columns
to the drop() method, along with 1 as the value for the axis parameter of the
drop method.
The following script drops the columns Subject and Grade from our dummy
dataframe.
Script 38:
Output:
You can also drop the columns inplace from a dataframe using the inplace =
True parameter value, as shown in the script below.
Script 39:
Output:
Run the following script to create a dummy dataframe for this section.
Script 40:
import pandas as pd
scores = [
{'Subject':'Mathematics', 'Score': 85 , 'Grade': 'B', 'Remarks': 'Good', },
{'Subject':'History', 'Score': 98 , 'Grade': 'A','Remarks': 'Excellent'},
{'Subject':'English', 'Score': 76 , 'Grade': 'C','Remarks': 'Fair'},
{'Subject':'Science', 'Score': 72 , 'Grade': 'C','Remarks': 'Fair'},
{'Subject':'Arts', 'Score': 95 , 'Grade': 'A','Remarks': 'Excellent'},
]
my_df = pd.DataFrame(scores)
my_df.head()
Output:
To filter rows using the filter() method, you need to pass the list of row
indexes to filter to the filter() method of the Pandas dataframe. Along with
that, you need to pass 0 as the value for the axis attribute of the filter()
method. Here is an example. The script below filters rows with indexes 1, 3,
and 4 from the Pandas dataframe.
Script 41:
Output:
You can also reset indexes after filtering data using the reset_ index() method,
as shown in the following script:
Script 42:
my_df2 = my_df2.reset_index(drop=True)
my_df2.head()
Output:
The dummy dataframe for this section is created using the following script:
Script 43:
import pandas as pd
scores = [
{'Subject':'Mathematics', 'Score': 85 , 'Grade': 'B', 'Remarks': 'Good', },
{'Subject':'History', 'Score': 98 , 'Grade': 'A','Remarks': 'Excellent'},
{'Subject':'English', 'Score': 76 , 'Grade': 'C','Remarks': 'Fair'},
{'Subject':'Science', 'Score': 72 , 'Grade': 'C','Remarks': 'Fair'},
{'Subject':'Arts', 'Score': 95 , 'Grade': 'A','Remarks': 'Excellent'},
]
my_df = pd.DataFrame(scores)
my_df.head()
Output:
To filter columns using the filter() method, you need to pass the list of column
names to the filter method. Furthermore, you need to set 1 as the value for the
axis attribute.
The script below filters the Score and Grade columns from your dummy
dataframe.
Script 44:
Output:
For this section, you will be using the Titanic dataset, which you can import
using the Seaborn library using the following script:
Script 45:
titanic_data = sns.load_dataset('titanic')
titanic_data.head()
Output:
To sort the Pandas dataframe, you can use the sort_values() function of the
Pandas dataframe. The list of columns used for sorting needs to be passed to
the by attribute of the sort_ values() method.
The following script sorts the Titanic dataset in ascending order of the
passenger’s age.
Script 46:
age_sorted_data = titanic_data.sort_values(by=['age'])
age_sorted_data.head()
Output:
To sort by descending order, you need to pass False as the value for the
ascending attribute of the sort_values() function.
Script 47:
Output:
You can also pass multiple columns to the by attribute of the sort_values()
function. In such a case, the dataset will be sorted by the first column, and in
the case of equal values for two or more records, the dataset will be sorted by
the second column and so on.
The following script first sorts the data by Age and then by Fare, both by
descending orders.
Script 48:
You will be using the Titanic dataset once again, which you download via the
following script.
Script 49:
titanic_data = sns.load_dataset('titanic')
titanic_data.head()
Output:
To find the number of all the unique values in a column, you can use the
unique() function. The script below returns all the unique values from the
class column from the Titanic dataset.
Script 50:
titanic_data["class"].unique()
Output:
To get the count of unique values, you can use the nunique() method, as shown
in the script below.
Script 51:
titanic_data["class"].nunique()
Output:
To get the count of non-null values for all the columns in your dataset, you
may call the count() method on the Pandas dataframe. The following script
prints the count of the total number of non-null values in all the columns of the
Titanic dataset.
Script 52:
titanic_data.count()
Output:
survived 891
pclass 891
sex 891
age 714
sibsp 891
parch 891
fare 891
embarked 889
class 891
who 891
adult_male 891
deck 203
embark_town 889
alive 891
alone 891
dtype: int64
Finally, if you want to find the number of records for all the unique values in a
dataframe column, you may use the value_counts() function.
The script below returns counts of records for all the unique values in the
class column.
Script 53:
titanic_data["class"].value_counts()
Output:
Third 491
First 216
Second 184
Name: class, dtype: int64
Exercise 3.1
Question 1:
Question 2:
To filter columns from a Pandas dataframe, you have to pass a list of column
names to one of the following methods:
A. filter()
B. filter_columns()
C. apply_filter ()
D. None of the above
Question 3:
To drop the second and fourth rows from a Pandas dataframe named my_df,
you can use the following script:
A. my_df.drop([2,4])
B. my_df.drop([1,3])
C. my_df.delete([2,4])
D. my_df.delete([1,3])
Exercise 3.2
From the Titanic dataset, filter all the records where the fare is greater than 20
and the passenger traveled alone. You can access the Titanic dataset using the
following Seaborn command:
titanic_data = sns.load_dataset('titanic')