0% found this document useful (0 votes)
155 views

Analyzing Data Using Python Filtering Data in Pandas

The document discusses filtering data in Pandas. It begins by introducing the loc and iloc functions for accessing specific data from a DataFrame. It describes using loc, iloc, at, and iat functions to filter data, as well as filtering using wildcards, regular expressions, and boolean predicates. The document also covers using the query function to filter data, manipulating datetime values, selecting and dropping columns, and applying advanced techniques like regular expressions for selecting and dropping columns. The goal is to provide an understanding of powerful filtering operations available in Pandas.

Uploaded by

martin napanga
Copyright
© © All Rights Reserved
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
155 views

Analyzing Data Using Python Filtering Data in Pandas

The document discusses filtering data in Pandas. It begins by introducing the loc and iloc functions for accessing specific data from a DataFrame. It describes using loc, iloc, at, and iat functions to filter data, as well as filtering using wildcards, regular expressions, and boolean predicates. The document also covers using the query function to filter data, manipulating datetime values, selecting and dropping columns, and applying advanced techniques like regular expressions for selecting and dropping columns. The goal is to provide an understanding of powerful filtering operations available in Pandas.

Uploaded by

martin napanga
Copyright
© © All Rights Reserved
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 52

14-Oct-21 549519163.

docx 1

Analyzing Data Using Python: Filtering Data in


Pandas
Not all data is useful. Luckily, there are some powerful filtering operations available in
pandas. The course begins with a detailed look at how loc and iloc can be used to access
specific data from a DataFrame. You'll move on to filter data using the classic pandas
lookup syntax and the pandas filter and query methods. You'll illustrate how the filter
function accepts wildcards as well as regular expressions and use various methods such
as the .isin method to filter data.

Furthermore, you'll filter data using either two pairs of square brackets - in which case
the resulting subset is itself a DataFrame - or a single pair of square brackets, in which
case the returned data takes the form of a Series. You'll drop rows and columns from a
pandas DataFrame and see how rows can be filtered out of a DataFrame. Lastly, you'll
identify a possible gotcha that arises when you drop rows in-place but neglect to reset
the index labels in your object.

/conversion/tmp/activity_task_scratch/549519163.docx
14-Oct-21 549519163.docx 2

Contents
Analyzing Data Using Python: Filtering Data in Pandas.................................................1
Contents.....................................................................................................................2
1. Course Overview..................................................................................................3
Objectives..............................................................................................................4
Instructor...............................................................................................................4
2. Performing Data Lookup Operations...................................................................5
3. Leveraging the loc and iloc Functions................................................................11
4. Using the loc, iloc, at, and iat Functions to Filter Data......................................16
5. Filtering Data Using Wildcards and Boolean Predicates....................................23
6. Using the Query Function to Filter Data............................................................29
7. Manipulating Data Using Datetime Values........................................................33
8. Selecting and Dropping Columns.......................................................................36
9. Applying Advanced Techniques to Select and Drop Columns...........................43
10. Course Summary.............................................................................................49
11. Test.................................................................................................................50

/conversion/tmp/activity_task_scratch/549519163.docx
14-Oct-21 549519163.docx 3

1. Course Overview
Topic title: Course Overview

Hi, and welcome to this course, Filtering Data in Pandas. My name is Vitthal Srinivasan
and I will be your instructor for this course. Your host for this session is Vitthal
Srinivasan. He is a software engineer and a big data expert. A little bit about myself
first.

I did my master's from Stanford University and have worked at various companies,
including Google and Credit Suisse. I presently work for Loonycorn, a studio for high
quality video content.

Pandas is an extremely popular and powerful Python library used for working with
tabular and time series data. The key abstraction in Pandas is the dataframe object,
which encapsulates data organized into named columns and uniquely identified rows.

This of course, is exactly how spreadsheets as well as relational databases represent


data.

And is also how many data analysts and computer scientists are accustomed to
modelling data mentally. This universality in design, coupled with a natural Syntax that
combines the best elements of pythonic as well as R style programming.

And constantly expanding APIs all help explain the meteoric rise in popularity of Pandas
over the last decade. We will begin this course by exploring how complex data filter
operations can be performed using the loc and iloc methods. We then move to filtering
data using either the Pandas filter and query methods which follow the pythonic
programming idiom.

Or the classic Pandas lookup syntax which derives heavily from R. We will learn how to
drop rows and columns from a Pandas dataframe, and see how rows can be filtered out
of a dataframe. Along the way, we will become aware of a possible gotcha that arises
when we drop rows in place. But then neglect to reset the index labels in our dataframe
object.

By the end of this course, you will have a solid grasp of filtering data in Pandas using the
loc and iloc methods. And will be aware of the best practices in dropping rows and
columns.

/conversion/tmp/activity_task_scratch/549519163.docx
14-Oct-21 549519163.docx 4

Objectives

 Discover the key concepts covered in this course

 Look up data using different techniques

 Apply the loc and iloc functions to access specific rows and columns

 Filter data using the loc, iloc, at, and iat functions

 Filter data using the loc, iloc, at, and iat functions

 Perform conditional filtering using the query function

 Parse and manipulate datetime values

 Select and drop specific columns

 Apply regular expressions and other advanced techniques to select and drop
columns

 Summarize the key concepts covered in this course

Instructor

Vitthal Srinivasan

/conversion/tmp/activity_task_scratch/549519163.docx
14-Oct-21 549519163.docx 5

2. Performing Data Lookup Operations


Topic title: Performing Data Lookup Operations. Your host for this session is Vitthal
Srinivasan.

In this series of demos, we will change our focus from computing basic statistics. And
towards searching, selecting, filtering, and querying data in a pandas data frame. Let's
go ahead and get started. A Jupyter notebook is open on the screen. This is a new
Python notebook. So we begin once again by importing pandas as pd, that's the alias.
Then we read in from Datasets/superstore_dataset.csv.

He enters the following command in the first code cell: import pandas as pd. He enters
a set of commands in code cell 2. The first command line is: superstore_data =
pd.read_csv( 'Datasets/superstore_dataset.csv'). The second command line is:
superstore_data.columns.

We read these contents into a data frame called superstore_data. And then we invoke
the .columns property on this. When we hit Shift+Enter, we get a list of all columns in
our data frame.

Let's begin by cleaning up the column names. This is an operation which we had
previously performed.

He enters a set of commands in code cell 3. The first command line is:
superstore_data.columns = [column.replace(' ', '_' ).upper() for column in
superstore_data.columns]. The second command line is: superstore_data.head().

We are going to redo it, we are making use of list comprehension. So on the right hand
side of the equal to sign we have a list comprehension expression contained between
square brackets. That expression is going to iterate over all of the column names. And
it's going to take each column name, convert it to uppercase. And along the way, also
replace every space with an underscore. The result of this will be to make our column
names very clear.

They'll be in uppercase, and they'll have underscores. You can see what this looks like
when we run the head command. That head command also tells us that we have 24
columns in our data. We can view all of them by scrolling towards the right. This is a nice
feature of Jupyter IPython Notebooks. In any case, let's move on and now run the head
command, but this time only on a specific column.

He enters the following command in code cell 4: superstore_data.REGION.head().

/conversion/tmp/activity_task_scratch/549519163.docx
14-Oct-21 549519163.docx 6

Note how we are making use of the Object notation in order to access the REGION
column. We have the name of the data frame, superstore_data, followed by the dot.
Followed by the exact name of the column, which is REGION, all uppercase. And then
we invoke the .head method on it. We can see that the results are the names of five
regions. However, from the formatting, it seems like this output is not a data frame.
Rather, it's a series.

And indeed, this is true. When we access an individual column from a data frame. Using
either the Object notation which you see on screen now, or using a single pair of square
brackets. What we get is not a data frame, but a data series. Let's confirm this by
running the type command on our superstore_data.REGION.head. You can see the
output of this on screen now.

He enters the following command in code cell 5: type


(superstore_data.REGION.head()).

This is of type pandas.core.series.Series. Let's keep going. Let's run another operation.
This time we are going to filter out the CUSTOMER_NAME column, and then run the
head command on this.

He enters the following command in code cell 6:


superstore_data[['CUSTOMER_NAME']].head(10).

But please note how we've specified the CUSTOMER_NAME column.

We have enclosed it in quotes, and we've used two pairs of square brackets before and
after this. This is going to have the effect of returning a data frame rather than a data
series. And we are then invoking the head command on that data frame. What columns
is that data frame going to include?

Well, every column which is in the list which we specified between the double square
brackets. Here, there's just one column in that list, and that's CUSTOMER_NAME. But
again, there's an important difference between indexing or filtering with two sets of
square brackets versus one set of square brackets. When you use two sets of square
brackets, as you see on screen now, the output is going to be a data frame.

We hit Shift+Enter, and we now see that we do indeed have a data frame. We can tell
from the formatting that data frame includes all of the customer names for all rows in
our original data frame. Then by running the head command with the input argument of
10, we are getting the first ten customer names. Please note that these ten customer

/conversion/tmp/activity_task_scratch/549519163.docx
14-Oct-21 549519163.docx 7

names are not unique values. These literally correspond to the customer names of the
first ten rows in the underlying data frame.

We can confirm that for instance, by looking at rows with labels 1, 4, and 5. All of these
have customer name Joseph Holt. So again, this difference between the single and
double pairs of square brackets is an important one. Let's perform another selection
operation. We are again going to specify a list of columns using two pairs of square
brackets.

This time we are going to specify multiple columns within that list.

He enters a set of commands in code cell 7. The first command line is:
customer_details = superstore_data[['CUSTOMER_ID', 'CUSTOMER_NAME', 'CITY',
'STATE', 'COUNTRY']]. The second command line is: customer_details.head().

Here the columns are CUSTOMER_ID, CUSTOMER_NAME, CITY, STATE, and COUNTRY.
This time because we have multiple columns, we have no choice but to use the double
square brackets. We hit Shift+Enter, and we see the output of the head command.

This is clearly a data frame. We have the first five rows of this data frame, and the data
frame only has the columns which we specified.

Next, let's create a list. This list is going to include some column names of interest to us.
So this list is called sales.

He enters the following command in code cell 8: sales = [ 'CATEGORY' , 'SUB-


CATEGORY' , 'SALES' ].

It's a plain old Python list, and it includes the names of three columns, CATEGORY, SUB-
CATEGORY, and SALES.

Once we have this list, let's go ahead and filter our data frame to only display these
three columns. So we've now performed a lookup operation on superstore_data.

He enters a set of commands in code cell 9. The first command line is: product_details
= superstore_data [sales]. The second command line is: product_details.head().

We have one pair of square brackets, inside which we have the variable, sales. The
return value from this is saved in a variable called product_details.

Before we hit Shift+Enter, and before we know what the output looks like. Let's try and
predict whether the output is going to be a data frame or a data series.

/conversion/tmp/activity_task_scratch/549519163.docx
14-Oct-21 549519163.docx 8

We've discussed so far that anytime we have two pairs of square brackets, we expect
the result to be a data frame.

Here, you might think that we see only the one pair of square brackets. But look more
closely. Sales itself is a list, and sales itself is delimited by a pair of square brackets. So if
we were to splice in the actual value of the sales variable, we are actually filtering using
two pairs of square brackets.

And so the output, that's product_details, is going to be a data frame rather than a data
series. Let's see whether our intuition is correct. We hit Shift+Enter, and indeed it is. The
output is a data frame which only includes the three columns that we have specified.

Next, let's perform a simple operation to tell us how many columns we have of each
dtype. The code for this is visible on screen now. We use the .dtypes property of the
superstore_data data frame.

He enters the following command in code cell 10:


superstore_data.dtype.value_counts().

Then we invoke the .value_counts method on this dtypes property. The result of this is a
dictionary. Each key in that dictionary corresponds to a dtype present in our data frame.

And the value corresponds to how many columns have that particular dtype. So we have
17 columns which have dtype object, 5 which have dtype float64, and 2 which have
dtype int64.

Note also that 17 plus 5 plus 2 gives us 24, and that is the total number of columns in
our data.

Let's make use of these dtypes in order to select only columns which are of a particular
dtype.

He enters a set of commands in code cell 11. The first command line is: selected_data
= superstore_data.select_dtypes(include=['float64', 'int64’]). The second command
line is: selected_data.tail(5).

This is an operation that we might want to do for instance, when we only want to filter
out and include all numeric types.

On screen now you can see that we've made use of a property called select_dtypes. This
is a property which we've invoked on the superstore_data data frame. This property
takes in an input argument. This is the named input argument, include.

/conversion/tmp/activity_task_scratch/549519163.docx
14-Oct-21 549519163.docx 9

Here we've specified a list that is a list of all types, that is of all dtypes which we wish to
include. Those are float64 and int64. The return value from this is also a pandas data
frame. And we can invoke the tail method on it. The output is visible down below.

You can see that we have five rows. These five rows have columns for all of the numeric
data types. That is for all of the int64 and the float64 columns in the original data frame.
We can count the number of columns, and we see that this works out to seven.

Let's confirm this by invoking the shape property. Remember that the shape property of
a data frame is going to give the number of rows and columns. Here when we invoke
the shape property on selected_data, He enters the following command in code cell 12:
selected_data.shape. we see that the number of rows is 51,290.

And the number of columns is 7. And we know that this is correct because up top in the
output of cell 10. We have 5 columns which are of type float64, and 2 columns which
are of type int64. So 5 plus 2 gives us 7, and that matches the number of columns in our
selected_data.

Now if we look really closely at the selected data, we spot a bit of redundancy. We can
see that we have the index column. This is the default set of index values generated by
pandas, 0 through 51,289. But in addition to these autogenerated labels, we also have a
column called ROW_ID. So it's quite clear that the data when we imported it in already
had a ROW_ID column. Why don't we just make use of those ROW_IDs and eliminate
the system generated labels?

Let's see how to do this. On screen now we've used the set_index property of our
superstore_data. The set_index property takes in a column name, that column name is
ROW_ID.

He enters a set of commands in code cell 13. The first command line is:
superstore_data.set_index( 'ROW_ID ', inplace=True) . The second command line is:
superstore_data.head().

The second argument is the famous inplace, and here we've specified that inplace is
equal to True.

This is going to have the effect of modifying the underlying data frame, that's
superstore_data. It's going to get rid of all of the default labels.

/conversion/tmp/activity_task_scratch/549519163.docx
14-Oct-21 549519163.docx 10

That is all of the values 0 through 51,289. And instead of those default labels, it's now
the ROW_ID, which is going to be used as the label column. You can see two other
interesting points.

There's now a name for the label column and that name is ROW_ID. The other
interesting point is that the number of columns has reduced by 1. We previously had 24
columns. But now as you can see from the little note in the bottom left, the number of
columns is 23.

Let's confirm this by re-invoking the shape property on our data frame. So now we run
superstore_data.shape, and when we hit Shift+Enter, we find that the number of rows is
the same, 51,290.

But the number of columns has decreased from 24 to 23. This little example
demonstrating the use of set_index is an important one.

He enters the following command in code cell 14: superstore_data.shape.

When the data that you read in from an external source already has a key column. You
should be making use of that key column as the index. You should not be relying on the
pandas default index labels of 0 through n-1. So in such situations using the set_index is
a good practice. And the example on screen shows you how this is done.

/conversion/tmp/activity_task_scratch/549519163.docx
14-Oct-21 549519163.docx 11

3. Leveraging the loc and iloc Functions


Topic title: Leveraging the loc and iloc Functions. Your host for this session is Vitthal
Srinivasan.

In the previous demo, we saw how the use of single and double pairs of square brackets
can change the return type of various operations on a Pandas dataframe.

Let's continue with that theme. On screen now you can see that we've made use of the
iloc property. Recall that iloc is used to access one or

He enters the following command in code cell 15: superstore_data.iloc [0].

more rows of a dataframe based on the index position.

Not on the index label itself, but merely the index position. So for instance here, when
we use iloc, along with an index value of 0. We are querying our dataframe for the first
row in the dataframe based on its index position.

Please note that we are using iloc with a single pair of square brackets. And within that
single pair of square brackets, we've enclosed the index position 0. So now when we hit
Shift+Enter, we are going to get the details of the first row.

However, we can plainly tell from the formatting that these details are not in the form
of a dataframe. What we see on screen now is a data series. Every one of the elements
in the data series has a label, and those labels correspond to the column names.

So for instance, the first field has label order ID, and it has value AG-2011-2040. So here
we've made use of one pair of square brackets, and the output is in the form of a
Pandas series.

Let's see what happens if we now use two pairs of square brackets instead of one. So we
are now going to have the exact same line of code with one crucial difference.

He enters the following command in code cell 16: superstore_data.iloc [[0]].

We are using iloc but with two pairs of square brackets, so 0 is delimited on either end
by two opening and two closing square brackets. When we hit Shift+Enter, this time,
what we get is a Pandas dataframe with a single row. Row ID is 42433, the little note in
the bottom left tells us that we have 1 row and 23 columns and this is exactly what we
expect.

/conversion/tmp/activity_task_scratch/549519163.docx
14-Oct-21 549519163.docx 12

So once again, we've demonstrated that anytime we access or index into a Pandas
dataframe, using two pairs of square brackets. The return type is a Pandas dataframe as
well. Let's expand on this idea, let's use the iloc property, but now with a list of index
positions instead of just one. We've now used iloc with a list, the index positions we are
interested in are 2, 3, and 5.

He enters a set of commands in code cell 17. The first command line is: selected_rows
= superstore_data .iloc [[2, 3, 5]]. The second command line is: selected_rows.

We've saved the return value in a dataframe called selected_rows, and then displayed
that. When we hit Shift+Enter, we find that we do indeed have a dataframe which has 3
rows and 23 columns. We can scroll to the right in order to view the values of all of the
different columns.

Let's confirm that the shape is what we expect it to be. So we invoke the shape property
on selected_rows and indeed, we have a tuple with two elements.

He enters the following command in code cell 18: selected_rows.shape.

The first gives the number of rows, that's 3, the second gives the number of columns,
which is 23.

Let's now turn our attention from iloc, to loc. Recall that the loc property can be used to
access rows in a dataframe based on the actual labels.  

He enters the following command in code cell 19: superstore_data .loc [[42433]].

Labels refer to values on the index column. Here on screen now, we've invoked the loc
property and we've passed in just the one label, that label is 42433. Do remember that
when we had used iloc of 0, the row label that we had gotten back was 42433.

So this operation is equivalent to iloc of 0. Also note that in the syntax on the screen
now we've made use of two pairs of square brackets. And so when we hit Shift+Enter,
we expect to get a Pandas dataframe which has 1 row and 23 columns. And that indeed
is the case. Just like we were able to use the iloc with a list of index positions 2, 3, and 5,
we can do the same with the loc property.

On screen now, we've used loc with a list, that list has three label values. And those
three label values actually correspond to the index positions, 2, 3, and 5.

/conversion/tmp/activity_task_scratch/549519163.docx
14-Oct-21 549519163.docx 13

He enters a set of commands in code cell 20. The first command line is: selected_rows
= superstore_data .loc [[22255, 48883, 11731]]. The second command line is:
selected_rows.

Once again, we've made use of the double square brackets and once again, the return
type is a Pandas dataframe. 3 rows and 23 columns exactly as we expect.

Let's try out a few different forms of syntax with loc and iloc. On screen now, we've used
iloc along with a range. This range is 1:5, now 1:5 itself is a list, so when a list is enclosed
within a pair of square brackets, that becomes a list within a list.

He enters the following command in code cell 21: superstore_data.iloc [1:5].

And that's why when we execute this using Shift+Enter, what we get back is a Pandas
dataframe. The range 1 through 5 has four values in it, and that's why the result has four
rows.

And we have 23 columns as usual, because that's the number of columns in our data
frame. We can scroll to the right and satisfy ourselves that all of the columns are in
there. Let's now move on and define a range in terms of a start and end index.

He enters a set of commands in code cell 22. The first command line is: start = 13879.
The second command line is: end = 30142.

Here we've defined start as 13879 and end as 30142, then we go ahead and use loc and
within loc we specify the start and end.

He enters the following command in code cell 23: superstore_data.loc [start:end].

Please remember that when we use loc we are actually specifying label values, we are
not specifying label positions. If we now run this code, we are going to get a dataframe
which has four rows. And if we look closely, we will see that the ROW_IDs, that is the
labels for these four rows, are all within the range that we specified. They've been
displayed in increasing order of a ROW_ID.

So the smallest is 13879, which is exactly equal to our start, and the largest is 30142,
which is exactly equal to our end. And the two intermediate rows have labels which lie
between the start and the end.

So in this way, we've made use of a range query to look up rows based on their ROW_ID
that is based on their labels. Next, let's turn to an even more interesting and powerful
operation, on screen now we've used iloc.

/conversion/tmp/activity_task_scratch/549519163.docx
14-Oct-21 549519163.docx 14

He enters a set of commands in code cell 24. The first command line is:
customer_data= superstore_data .iloc [:, [0, 4, 5, 16, 17]]. The second command line is:
customer_data.head().

But now within iloc we've specified both row and column filters. You can see that there
is a comma, and before the comma is the colon sign. This is our way of telling Pandas
that we are interested in all rows. What comes after the comma defines the column
filters.

The column filters take the form of a list, and the list values are 0, 4, 5, 16, and 17.

The return value from this iloc operation is of course, a Pandas dataframe and then
we've invoked head on this. Before we run this using Shift+Enter, let's take a moment to
try and predict what the output is going to look like.

Remember that we've ask for all rows and then run the head, so we are going to have 5
rows. How many columns, well, as many columns as we have list elements, so that's 0,
4, 5, 16, 17 that gives us five columns.

So we expect five rows and five columns. Let's hit Shift+Enter and this is indeed the case.
We have five rows, we can count the ROW_IDs. And we have the five columns which
correspond to ORDER_ID, CUSTOMER_ID, CUSTOMER_NAME, PRODUCT_NAME and
SALES.

This is the first example where we've used iloc with filters on both rows and columns.
And it's also the first example where we've indexed into the columns using their column
numbers rather than the column names.

Let's verify the shape of this customer_data dataframe. We invoke the shape property
on it, and we can see that it does include 51290 rows and 5 columns.

He enters the following command in code cell 25: customer_data.shape.

So we've got all of the rows in the original dataframe because the row specifier or the
label specifier was simply the colon.

The colon is a way of saying that we are fine with all values of row labels. This iloc
operation you see on screen now is actually a little hard to read. And the reason for this
is that we've included columns based on the column indices rather than the column
names.

/conversion/tmp/activity_task_scratch/549519163.docx
14-Oct-21 549519163.docx 15

But remember that the iloc property always works with indices. And so if we want to
look up columns using iloc, we've got to specify the column indices rather than the
column names. In order to use the column names, we'll have to use loc instead, and
that's exactly what you can see on screen now.

He enters a set of commands in code cell 26. The first command line is:
product_details= superstore_data .loc [:, ['MARKET', 'PRODUCT_NAME', 'SALES',
'PROFIT']]. The second command line is: product_details.head().

We've used a loc instead of iloc, and we've used column names rather than column
indices.

It's usually a much better idea in terms of code readability and maintainability to look up
columns using the column names. As in the previous example, we also have a row filter,
but that row filter is simply the colon, which means that we accept all rows.

Let's run this and we can see that we indeed get back a dataframe which has the
columns MARKET, PRODUCT_NAME, SALES, and PROFIT. Finally, also note that we've
made use of the two square brackets. The outer square brackets are used to define both
the row and the column filters.

The inner square brackets are used only in order to define the list of column names.
Let's confirm the shape using the shape property, and we can see that this indeed has all
rows, so all 51,290 rows, but just the 4 columns.

He enters the following command in code cell 27: product_details.shape.

/conversion/tmp/activity_task_scratch/549519163.docx
14-Oct-21 549519163.docx 16

4. Using the loc, iloc, at, and iat Functions to Filter Data
Topic title: Using the loc, iloc, at, and iat Functions to Filter Data. Your host for this
session is Vitthal Srinivasan.

Let's continue building on the use of loc and iloc. A Jupyter notebook is open on the
screen. We are now going to use iloc and we are going to specify a range for the rows as
well as a range for the columns. The range for the rows is the first list, it has two
elements, 2 and 4.

He enters a set of commands in code cell 28. The first command line is: data =
superstore_data .iloc [[2, 4], [5, 16]]. The second command line is: data.

The range for the columns is the second list, it has the elements 5 and 16. Remember
here that iloc only takes an index positions.

So 2 and 4 are going to give us the second and the fourth rows. 5 and 16 are going to
give us the fifth and the 16th columns.

And those two columns correspond to the customer name and the product name. Let's
now perform a similar operation using loc instead of iloc. Loc tends to be a lot easier to
read and understand. However, while working with loc, we've got to explicitly specify
the row labels and the column names.

We cannot make use of row and column indices. On screen now we've defined two row
labels, these are 32593 and 36388.

He enters a set of commands in code cell 29. The first command line is: row = [32593,
36388]. The second command line is: columns = ['CUSTOMER_NAME',
'PRODUCT_NAME']

And we've also defined two column names, CUSTOMER_NAME and PRODUCT_NAME.

Once we've defined these two lists, we can go ahead and make use of those two lists in
the invocation of the loc property.

He enters a set of commands in code cell 30. The first command line is: data =
superstore_data .loc [row, columns]. The second command line is: data.

So superstore_data.loc, and then we pass in the rows and the columns. We hit
Shift+Enter and examine the output. As expected, we have two rows and two columns.

/conversion/tmp/activity_task_scratch/549519163.docx
14-Oct-21 549519163.docx 17

The two rows have the exact row labels which we passed in, and the two column names
are CUSTOMER_NAME and PRODUCT_NAME.

Let's now see how we can access an individual value from a Pandas dataframe.

He enters the following command in code cell 31: superstore_data .iloc [5, 5].

On screen now we've done this using the iloc property. We've passed in a row index
that's 5, and a column index which is also 5. When we execute this code, we get the one
value. This is a scalar value that's just a string, which is present at row 5 in column 5 and
this is the name Joseph Holt.

Any operation that can be performed using iloc with indices can be performed with loc
using actual labels. So here we've now indexed into the superstore data dataframe.

He enters the following command in code cell 32: superstore_data .loc


[31454,'CUSTOMER_NAME'].

We've passed in a row label that's 31454. And we've passed in a column name which is
CUSTOMER_NAME. When we execute this we get back the exact value.

That's the scalar value present in the row with label 31454 in the column
CUSTOMER_NAME, and that is Dave Brooks. As you can see from these two examples,
it's also possible to accomplish a lot using just the single square brackets.

He enters the following command in code cell 33: superstore_data .iloc [:5].

Let's see how we can use a single square bracket in order to recreate the head function.

On screen now we've used iloc and we passed into iloc a specifier which has a colon.
There's nothing before the colon, but after the colon is the number 5. This is going to
have the effect of returning the first five rows based on the index position. And of
course, that's exactly what the head function does by default.

Notice how we just passed in the one expression into iloc and so that was treated as a
row filter. Let's repeat this experiment. This time we'll pass in two expressions
separated by commas.

He enters the following command in code cell 34: superstore_data .iloc [:5, :].

The first of these is going to be the row filter. The second will be treated as the column
filter.

/conversion/tmp/activity_task_scratch/549519163.docx
14-Oct-21 549519163.docx 18

The row filter is the same as before, so we've asked for the first five rows. The second
expression is simply a colon, so we are asking for all columns. And this again has the
same effect. This once again is going to recreate the output of the head function on a
dataframe. Next, let's turn our attention to how loc and iloc work together.

On screen now, we've made use of a new method.

He enters a set of commands in code cell 35. The first command line is: start_column =
superstore_data.columns.get_loc( 'ORDER_ID'). The second command line is:
end_column = superstore_data.columns.get_loc( 'SEGMENT').

This is the get_loc method on the columns of a dataframe. So, we have a start_column,
which we define as a superstore_data.columns.get_loc. And then the name of the
column, that's ORDER_ID. And then the end_column, which is a similar operation for the
SEGMENT column. start_column and end_column can be thought of as positions.

These are locations that is index locations within the list of all columns. Once we have
the start and the end column, we can index our dataframe using iloc. And when we have
iloc, we can specify a row as well as a column filter.

We are going to specify the start and end columns of course as the column filters.

He enters the following command in code cell 36: superstore_data .iloc [:5,
start_column:end_column].

We do have a row filter and that row filter is :5. So we are asking for all of the rows with
index position 0 through 4.

Let's go ahead and run this iloc operation. As you might expect, we are going to get five
rows. And we are going to get all of the columns starting with ORDER_ID and ending
with but not including segment. So there are two little points to be learned from this
example.

The first is that we can use the get_loc property on the columns of a dataframe in order
to get the specific index location of a specific column. That's the first little lesson. The
second lesson is that when we specify row and column indices, the second index is not
included in the return range.

This is true for both row and columns. You can see for instance, that the row filter is :5.
And so we are going to get rows with the index positions 0 through 4, 5 is not included.

/conversion/tmp/activity_task_scratch/549519163.docx
14-Oct-21 549519163.docx 19

Likewise for the columns, we have a start_column to end_column. The start_column is


ORDER_ID and that is included in the result dataframe. The end_column is SEGMENT
and that's not included in the resulting dataframe.

In this example on screen now, we did not know the index locations for these two
columns, but we looked them up using get_loc. We can do something similar with rows.
There, it's a little more direct, we can just look up into the index property using index
positions.

He enters a set of commands in code cell 37. The first command line is: start_row=
superstore_data.index [10]. The second command line is: end_row=
superstore_data.index[15].

So here we have a start_row, which is the tenth position inside the index of
superstore_data and end_row, which is at the 15th position. Remember that the index
of a Pandas dataframe gives a list of all of the row labels. So we started with the index
positions and from that we got the actual labels at that index. start_row and end_row
are now labels and not positions.

And that's why we've got to make use of loc rather than iloc. On screen now we've
made use of loc, we've specified a row as well as column filters. The row filter has the
start_row and the end_row. Remember both of those are labels and not index positions.

And the column filter has the initial column name, which is PRODUCT_ID and the last
column name which is SALES. Let's go ahead and hit Shift+Enter and execute this.

He enters the following command in code cell 38: superstore_data .loc


[ start_row:end_row, 'PRODUCT_ID': 'SALES'.

We get the resulting Pandas dataframe. And we can see immediately that we have all of
the columns starting from PRODUCT_ID and ending with SALES. This gives us a hint
about another difference between loc and iloc.

We saw that when we use iloc, the end of the range that we specify is not included in
the result. That's in keeping with the idea that index positions start from 0 and go up to
n-1.

When we use loc on the other hand, both the start and the end column or ROW_IDs are
going to be included. We can see that on screen now, we have every column up to and
including the SALES column. So far, most of our operations have focused on extracting
one dataframe from another.

/conversion/tmp/activity_task_scratch/549519163.docx
14-Oct-21 549519163.docx 20

The subset of rows and columns that we have obtained has been in the form of a
dataframe itself. Along the way, we did learn how to extract entire columns using single
square brackets. But we've paid little attention so far to extracting individual values, so
let's do exactly that.

He enters the following command in code cell 39: superstore_data


.at[39607,'COUNTRY'].

On screen now we've made use of the at operator in order to extract a specific value
from a particular dataframe. This is a value which is at a particular row label, 39607, and
a particular column, which is COUNTRY. When we execute this command, we see that
what we get back is neither a dataframe nor a series. It's just a string, that string is the
United States. Now we could have accomplished something similar using the loc
operator.

Here's an example on screen now. Use a loc operator to look for the row where label is
equal to 1, and the column is equal to COUNTRY.

He enters the following command in code cell 40: superstore_data .loc[1,'COUNTRY'].

This gives us a string as well. This corresponds to the string Mexico. We can see from
cells 39 and 40 that when we want to extract a single value from a dataframe, we have a
choice. We can use either loc or at. But at is more efficient.

And in order to prove this, let's make use of a handy little Python utility called timeit.
Timeit is a Python module which provides a simple way to time small bits of Python
code.  

He enters the following command in code cell 41: timeit superstore_data


.loc[31454,'COUNTRY'].

It does so in a correct way, and it avoids many possible mistakes which can be
performed or introduced while timing bits of code. This is what the timeit docs tell us.

On screen now we've used timeit in order to compare the performance of loc and at
operations. We first use timeit to measure how fast loc is. We can see from this that the
mean time is 7.18 microseconds. There's also a standard deviation of 35.2 nanoseconds.
There's additional detail there about the number of runs and the number of iterations.
Let's now repeat this operation with at instead of loc.

And we find that the mean time reduces quite significantly, it's only 4.48 microseconds.

/conversion/tmp/activity_task_scratch/549519163.docx
14-Oct-21 549519163.docx 21

He enters the following command in code cell 42: timeit superstore_data


.at[31454,'COUNTRY'].

And the standard deviation is still quite small, 39.9 nanoseconds. We can see from these
two operations. Both of which look up specific values which correspond to the row label
31454 and the column COUNTRY. That at is significantly faster than loc.

Now just like we have loc and iloc, there's also at and iat. Let's try out those as well.
Let's extract the row and column indexes corresponding to specific values.

He enters a set of commands in code cell 43. The first command line is: row_name=
superstore_data.index.get_loc (16727). The second command line is: column_name=
superstore_data.columns.get_loc ('COUNTRY').

So we get the row index corresponding to the label 16727 and the column index
corresponding to the column country. Once we have these two indices, we can compare
the performance of iloc and iat.

He enters the following command in code cell 44: timeit superstore_data


.iloc[row_name, column_name].

We've done this using timeit once again. And when we run the test, we can see that this
time, the difference in performance is not quite as prominent.

He enters the following command in code cell 45: timeit superstore_data


.iat[row_name, column_name].

The average time for iloc is 25 microseconds and the standard deviation is 361
nanoseconds. The average time for iat is 23.1 microsecond and a standard deviation of
297 nanoseconds.

It's still pretty clear that iat is faster, significantly faster than iloc, but the delta is not
quite as marked as it was between loc and at. In any case that gets us to the end of our
exploration of loc, iloc, at and iat.

We've explored several different ways of accessing rows, columns, and individual cells
within a Pandas dataframe.

/conversion/tmp/activity_task_scratch/549519163.docx
14-Oct-21 549519163.docx 22

5. Filtering Data Using Wildcards and Boolean Predicates


Topic title: Filtering Data Using Wildcards and Boolean Predicates. Your host for this
session is Vitthal Srinivasan.

Is this demo, we'll turn our attention away from loc, iloc, at, and iat towards more
complex filtering operations. Let's start simple, we begin by invoking the sample method
on our DataFrame, and we specify an input argument of 10.

A Jupyter notebook is open on the screen. He enters the following command in code
cell 46: superstore_data .sample(10).

This is going to pick ten rows at random.

So we have 10 rows and 23 columns, and we get a good sense for the names of these
columns. Now let's go ahead and start filtering. We are now going to make use of the
filter function for the first time. As its name would suggest, the filter function can be
used to restrict the rows or columns that we get back from a DataFrame.

He enters the following command in code cell 47: superstore_data .filter


(items=['PRODUCT_NAME','TOTAL_SALES']).

Here, we are specifying column names. Those column names are PRODUCT_NAME and
TOTAL_SALES. They have been packed up into a list and assigned to the named input
argument, items. The filter method actually has some additional named arguments,
including axis, regex, and like. We'll get to those in a moment. For now, let's just run this
command. Here, pandas is smart enough to figure out which axis we are filtering on.

It's smart enough to figure out that these names refer to column names. And so when
we run this, we are only going to get the columns which are in the list we specified. But
wait a minute, when we run this, we only have ROW_IDs and the PRODUCT_NAME
column. What happened to TOTAL_SALES? Well, the problem is that we had replaced
TOTAL_SALES with just the name SALES.

That's why the result has just one column, but it does have all of the rows. We have
51,290 rows, and we know that that's all of the rows in our DataFrame. Let's try a
different variant of the filter operation. Here, we are going to specify the like input
argument.

He enters the following command in code cell 48: superstore_data .filter


(like='NAME').

/conversion/tmp/activity_task_scratch/549519163.docx
14-Oct-21 549519163.docx 23

This is a string, and it can be used for text-based matches on either row labels or column
NAMES.

Here, by default, this like operation is going to apply to column NAMES. So we are in
effect requesting pandas to give us all rows for all columns where the column name
contains the word NAME. There are two such columns, CUSTOMER_NAME and
PRODUCT_NAME. You can see down below that we once again have all 51,290 rows for
these two columns.

Here, the input argument we specified to like was applied to the column names. Let's
see how we can change that and filter on the row labels instead. On screen now, we've
applied the filter operation with the same like input argument.

He enters the following command in code cell 49: superstore_data .filter (like='3862',
axis = 0).

The difference is that we've now also specified the value of the axis. axis = 0
corresponds to the rows, that is to the index labels. So when we run this command,
we're going to get all rows where the ROW_ID is like 3862. If you look closely, this
includes ROW_ID 3862, 38620, 38621, and so on and so forth. Every ROW_ID which
contains 3862 has been selected by this filter operation.

We can scroll down and see that we have 15 rows and 23 columns. So clearly, our like
filter applied only to the ROW_IDs. It did not exclude any of the 23 columns from the
original DataFrame. Let's try yet another variant of the filter operation. This one takes in
a named input argument called regex.

He enters the following command in code cell 50: superstore_data .filter


(regex='_').head().

By default, the regex is going to apply to axis = 1. That is, it's going to apply to the
column names. Here, the regex or regular expression that we've specified simply
consists of the character, the hyphen.

So when we apply this filter, we're going to get all columns which contain the hyphen,
that's the dash character. We've then invoked the head property, so the result has just
five rows. And those five rows correspond to just one column, SUB-CATEGORY. And
that's because SUB-CATEGORY is the only column in this DataFrame which contains a
hyphen. You can see the hyphen between the words SUB and CATEGORY. Let's apply
another regular expression.

/conversion/tmp/activity_task_scratch/549519163.docx
14-Oct-21 549519163.docx 24

This time, we are going to make use of the special character \A. \A followed by a pattern
is going to return all values which start with that pattern.

He enters the following command in code cell 51: superstore_data .filter


(regex='\AORDER').head().

Here, the pattern is the string ORDER. Once again, by default, this is going to apply to
axis = 1, that is to the column names.

And so this is going to filter out all column names which begin with the word ORDER.
Let's go ahead and run this, and we see that we do indeed have five rows. That's
because this was a head operation. What's more interesting is that all of the columns
which show up in the result do indeed begin with the word ORDER. So we have
ORDER_ID, ORDER_DATE, and ORDER_PRIORITY. We've clearly been able to successfully
filter using this regular expression.

We've now explored several different variants of the filter function. We've seen how to
filter rows as well as columns. And how we can apply filters based on either a list of
items using a like expression or a regular expression. Let's now turn our attention to a
different kind of filtering and querying operation, this time based on a Boolean
predicate. On screen now, we have a Boolean predicate called category_filter.
category_filter is equal to the result of a Boolean expression.

He enters a set of commands in code cell 52. The first command line is: category_filter
= superstore_data[ 'CATEGORY'] == 'Office Supplies'. The second command line is:
category_filter.

That Boolean expression tests whether the CATEGORY column of our DataFrame is
equal to a specific value. That specific value is Office Supplies. So if we focus only on the
left-hand side of the assignment operation, that is a Boolean predicate.

That Boolean predicate is going to return True for every row where CATEGORY is equal
to Office Supplies. And it's going to return False for every row where the CATEGORY is
not equal to Office Supplies. As we can tell from the formatting of the output, this is a
series. This is a series of values where the row labels are the ROW_IDs. And the
corresponding values are either True or False, based on whether that ROW_ID had
CATEGORY equal to Office Supplies or not. Once we've constructed this category_filter,
we can use it to filter our data in the original DataFrame.

And that's exactly what we've done on screen now.

/conversion/tmp/activity_task_scratch/549519163.docx
14-Oct-21 549519163.docx 25

He enters the following command in code cell 53: superstore_data[category_filter].

We are restricting the DataFrame superstore_data based on the value of category_filter.


For every ROW_ID where category_filter is equal to True, we will get one row in the
output. Every False value in category_filter is going to lead to the corresponding row
being omitted from the output.

The result of all of this is that the output is only going to consist of rows where the
CATEGORY is equal to Office Supplies. We can see that this is indeed the case by
scrolling over to the right. If you focus on the CATEGORY column, you can see that every
row in the output has CATEGORY equal to Office Supplies. And so in this way, we've
been able to filter based on a binary predicate, or a binary condition.

He enters a set of commands in code cell 54. The first command line is:
multiple_categories_filter= superstore_data['SUB-CATEGORY'].isin(. The second
command line is: ['Paper', 'Binders', 'Labels']).

Let's try another condition.

This time we are going to try and filter out only those rows where the SUB-CATEGORY is
in a list. And that list has the specific values Paper, Binders, and Labels. So here we've
introduced the isin operator, the isin operator can take in a list as an input argument.
And then we've applied that isin operator to a specific column. That column is the SUB-
CATEGORY column of our superstore_data.

Once again, this is the result of a binary predicate. The rows in the output are either
going to be True or False. Every row where the subcategory is in Paper, Binders, or
Labels is going to have a ROW_ID and the value True.

Every other row will have the ROW_ID and the value False.

He enters the following command in code cell 55:


superstore_data[multiple_categories_filter].

If we now filter using this multiple_categories_filter, we will only be left with those rows
where the value in the filter was True. And this means that we will only have rows
where the SUB-CATEGORY is in Paper, Binders, or Labels.

We can satisfy ourselves that this is the case by looking closely at the values in the SUB-
CATEGORY column. It's also easy enough to reverse the results of a binary predicate. We
simply make use of the tilde character.

/conversion/tmp/activity_task_scratch/549519163.docx
14-Oct-21 549519163.docx 26

He enters the following command in code cell 56: superstore_data[~category_filter].

On screen now, we are filtering the superstore_data DataFrame. Once again, we use the
category_filter, but we employ the tilde a character to get the inverse of the binary
predicate. The result of this is going to be to return all rows where the CATEGORY is not
equal to Office Supplies. And we can see that if we scroll over towards the right, every
value in the CATEGORY column is something other than Office Supplies.

The tilde character is a great way of reversing a condition filter. Let's move on and
construct some logical and relational conditions.

He enters the following command in code cell 57:


superstore_data[superstore_data['SALES'] > 10000.0].

Now on screen, we are only extracting those rows from superstore_data where the
SALES are greater than $10,000.

The syntax is pretty intuitive. And when we run this code, we get back a DataFrame
which has just five rows, because there are only five rows where the sales exceed
10000. And we have all columns, because we have not applied any filter on the columns.

We can scroll over to the right and verify that all of the SALES are greater than 10000.
Notice how we made use of the single square brackets while applying this filter. Let's go
ahead and construct a similar filter on the SHIPPING_COST column.

But this time, we will use the object notation in order to get a handle to the
SHIPPING_COST column.

He enters the following command in code cell 58:


superstore_data[superstore_data.SHIPPING_COST> 900.0].

We can see inside the square brackets that our condition makes use of the syntax
superstore_data.SHIPPING_COST. And it then has the greater than operator and $900.
And we can see from the result that every one of the rows which appears has
SHIPPING_COST of greater than 900.

Let's round off this demo with an even more complex condition.

He enters a set of commands in code cell 59. The first command line is: year_2013 =
superstore_data[superstore_data['ORDER_DATE'].str.endswith(‘2013')]. The second
command line is: year_2013.head().

/conversion/tmp/activity_task_scratch/549519163.docx
14-Oct-21 549519163.docx 27

We are now trying to extract all rows which correspond to the year 2013. The way we
are doing this is using the endswith function. This endswith function is applied to the
string value contained within the ORDER_DATE column.

And the endswith function takes in an input argument. That input argument is the string
2013. In effect, we are filtering out only those rows from our data where the
ORDER_DATE ends with 2013.

And when we run this, we can see that this does work. All of the rows in the result do
indeed have an ORDER_DATE of 2013.

We've successfully demonstrated the use of various regular expressions, filters, and
other types of predicates in filtering data in pandas DataFrames.

/conversion/tmp/activity_task_scratch/549519163.docx
14-Oct-21 549519163.docx 28

6. Using the Query Function to Filter Data


Topic title: Using the Query Function to Filter Data. Your host for this session is Vitthal
Srinivasan.

In this demo we'll continue building on what we learned in the previous demo. And
continue our exploration of selecting, filtering and querying operations on pandas
dataframes. Let's get started by writing a simple query which returns all rows where the
customer is in the United States. The syntax for this is on screen now.

He enters a set of commands in code cell 60. The first command line is: customer_us =
superstore_data[superstore_data[ 'COUNTRY'] == 'United States']. The second
command line is: customer_us .head( ).

You can see that we apply a filter condition to the data frame superstore_data. That
filter condition in and of itself first looks up the country column, and then checks for
equality with a specific string, which is the United States. Notice how the double equal
to operator is used for logical equality comparisons and the single equal to operator is
used for assignment.

As a result of this code, we are going to have in the variable customer_us. All of the
rows from the original data frame where the country was equal to the United States.
Let's confirm this by hitting Shift+Enter. And because this is a head operation, we only
have five rows in the result. We do have all 23 columns. And if you look at the COUNTRY
column way over towards the right, you can see that all of these rows have a country
equal to the United States. This was a nice, gentle warm up for the kind of querying and
filtering operations we're building towards. Our next little example is very similar, but it
makes use of the query method which is available on pandas data frames.

He enters a set of commands in code cell 61. The first command line is: customer_nyc =
superstore_data.query( "CITY == 'New York City'"). The second command line is:
customer_nyc .head( ).

The query method is very similar to the syntax we've just used, but it's somewhat more
pythonic. It takes in predicates, you can see here that we've not used any square
brackets at all. As the pandas docs tell us, this query function is used in order to filter
the columns of a dataframe with a Boolean expression. Here the Boolean expression
checks whether the city is equal to the string, New York City.

Notice how we have escape the string New York City. It's enclosed within single quotes,
and the whole Boolean expression is enclosed within double quotes. Let's go ahead and

/conversion/tmp/activity_task_scratch/549519163.docx
14-Oct-21 549519163.docx 29

execute the simple query method. And we see that we do indeed have all of the rows
where the CITY column contains the string New York City. The predicate we passed into
the query method here was quite simple. Let's try something a little more complex.

He enters a set of commands in code cell 62. The first command line is:
us_and_mexico_customers = superstore_data.query('COUNTRY in ["United States",
"Mexico"] & SALES > 1000.0'). The second command line is:
us_and_mexico_customers.sample(5 ).

Now, on screen is yet another invocation of the query method, but this time we have a
compound condition. This compound condition is actually quite complex. We check
whether the country is in the list United States, Mexico and whether the sales are
greater than $1,000. This predicate is interesting in two different ways.

Note first off the use of the logical & operator. Also note that this logical &
operator consists of a single ampersand. It does not consist of a double ampersand and
it does not consist of the string E-N-D. The other bit that's worth noting is the pythonic
nature of the condition. We are checking whether country is in a list and this syntax
looks a lot more like Python. It doesn't involve any of the square brackets or other
syntax which is specific to pandas. Let's go ahead and execute this code using
Shift+Enter. And we see that the resulting data frame does indeed only contain rows
where the COUNTRY is either the United States or Mexico. Then let's scroll over to the
right to examine the SALES, and we can indeed see that every one of the rows here has
sales greater than $1,000.

And so in this way, we've successfully executed a compound predicate using the query
method of a pandas dataframe. The output that you now see on screen is that from a
sample command. And you can see that when we invoke the sample method, we
specified the number of rows that we wanted sampled as equal to 5. What if we'd like
to quickly tell at a glance, how many rows in total satisfy this condition? Well, the
answer is simple enough. Simply invoke the shape method on the result dataframe.

He enters the following command in code cell 63: us_and_mexico_customers.shape.

So let's do that and hit Shift+Enter. And we get a tuple which has 598, 23. This tells us
that there are 598 rows and 23 columns in the result. The logical & and or
operators can be used with conventional pandas syntax just as easily as they can be
used with the query method. On screen now is an example in which we are filtering
from our dataframe superstore_data.

/conversion/tmp/activity_task_scratch/549519163.docx
14-Oct-21 549519163.docx 30

He enters a set of commands in code cell 64. The first command line is:
superstore_data[ (superstore_data[ 'SHIP_MODE'] == 'First Class') &. The second
command line is: (superstore_data[ 'SALES'] > 10000)].

And this filter consists of the logical & of a two conditions. The first of those two
conditions checks whether the SHIP_MODE is equal to First Class. The second of those
two conditions checks whether the SALES are greater than 10,000. Once again, note
how these two predicates are linked using the logical & operator, which is just a
single ampersand. The result of ending these two predicates is then used inside a single
pair of square brackets, and the result of all of this is a data frame. That data frame
happens to have just two rows, because there are only two rows in our data set, which
satisfy both these conditions. Where the SHIP_MODE is equal to First Class, and the
sales value is greater than 10,000. We've got to scroll over to the right in order to
examine the SALES column, but both of those values are indeed greater than 10,000.
Now this syntax which you can see in cell 64, is longer than it needs to be.

There really was no reason for us to have the two separate conditions. We could just
condense this into the syntax which you now see on screen.

He enters the following command in code cell 65: superstore_data.query("SHIP_MODE


== 'First Class' & SALES > 10000").

Using the query method, all we needed to do was pass in a simple predicate.
SHIP_MODE == 'First Class' & SALES > 10,000. We can execute this query and see that
we get the same results just two rows, but it's a lot more succinct. This tells us why the
query method is preferable in many situations to building the complex pandas
predicates. Which involved the use of square brackets. The original inspiration for
pandas and for data frames came from R. However, R's syntax often leads to the more
complex conditions which we've seen so far. The query syntax which you see on screen
now is much more like ordinary Python, it's easier to read and write. Let's go through
another example which demonstrates the same point.

On screen now is a complex condition written in the old school R style.

He enters a set of commands in code cell 66. The first command line is:
superstore_data[ (superstore_data[ 'ORDER_PRIORITY'] == 'Critical') |. The second
command line is: (superstore_data[ 'SHIP_MODE'] =='Same Day')].tail().

So we first check whether ORDER_PRIORITY is equal to Critical. We then perform a


logical OR operator with another condition, namely that SHIP_MODE is equal to Same
Day. Note how each one of these conditions needs to be applied within square brackets.

/conversion/tmp/activity_task_scratch/549519163.docx
14-Oct-21 549519163.docx 31

We've got to logically or them using the single pipe operator. That then goes into
another pair of square brackets, which is then used to apply to superstore_data. And
finally we invoke the tail method. We run this and we see that we indeed have 5 rows
and 23 columns. Because this is a logical OR condition, there are going to be some rows
where one or the other of the conditions will not be met.

So for instance, the third row on screen now which has ROW_ID 16469, has SHIP_MODE
equal to First Class. That SHIP_MODE is not equal to Same Day. But if we scroll over to
the right, we can see that the other condition is satisfied, the ORDER_PRIORITY is indeed
Critical. And that's why it's been included in the result. Next, let's rewrite the same
condition in Python-like syntax rather than R-like syntax.

He enters the following command in code cell 67:


superstore_data.query("ORDER_PRIORITY == 'Critical' |SHIP_MODE =='Same
Day'").tail().

This involves the use of the query method, and a much shorter and simpler string
predicate. That string predicate checks where ORDER_PRIORITY == 'Critical', note how
critical is enclosed within single quotes. Then we have the logical OR operator which is a
single pipe sign. And then comes the second condition, SHIP_MODE == 'Same Day'. And
once again same day is enclosed within single quotes. This entire string predicate is
passed into the query method, the result of applying this is a dataframe on which we
can invoke the tail method. And now when we hit Shift+Enter, we get the same result
that we had a moment ago. For instance, you can see that the third row in the result is
the same row ID, 16469. That's the row we had spot checked a moment ago where the
SHIP_MODE was First Class. But where we had checked by scrolling over to the right,
that ORDER_PRIORITY is equal to Critical.

/conversion/tmp/activity_task_scratch/549519163.docx
14-Oct-21 549519163.docx 32

7. Manipulating Data Using Datetime Values


Topic title: Manipulating Data Using Datetime Values. Your host for this session is
Vitthal Srinivasan.

In this demo, we will continue with the querying and the filtering operations, which we
were working on in the previous demo, but we'll now apply these operations to data
containing dates.

He enters a set of commands in code cell 68 . The first command line is:
superstore_data[ 'ORDER_DATE'] =
pd.to_datetime(superstore_data[ 'ORDER_DATE' ]). The second command line is:
superstore_data[ 'SHIP_DATE'] = pd.to_datetime(superstore_data[ 'SHIP_DATE' ]). The
third command line is: superstore_data.info().

On screen now, we are changing the type of two columns in our data frame. These are
the ORDER_DATE and the SHIP_DATE. The way in which we're doing this is as follows.
We make use of the pd.to_datetime method, we apply that method to the
ORDER_DATE column, as well as to the SHIP_DATE column. This has the effect of
changing the D type of these columns from object that is string to datetime. All of this is
done on the right-hand side of the equal to sign. On the left-hand side, we are assigning
these newly converted columns back into the corresponding columns of our
superstore_data.

So the effect of running this snippet of code is to change the type of the ORDER_DATE
and the SHIP_DATE column to be of datetime. This is a change that is being made in
place, we are modifying the original data frame. Finally, we invoke the info method on
this modified data frame. And when we run this, we can confirm that ORDER_DATE and
SHIP_DATE now have D type of datetime64[ns]. That's the standard datetime format
used in Python as well as in pandas. We've now seen how to convert string columns into
date columns. Having done this, we can work on them by sorting in datetime order.

On screen now, we're using the sort_values method.

He enters a set of commands in code cell 69 . The first command line is:
superstore_data.sort_values('ORDER_DATE', ascending=True, inplace=True). The
second command line is: superstore_data.head().

The first input argument is the column name, that's ORDER_DATE. The second is the
type of sort, so we have ascending set to True. And finally, we have the inplace name
argument, which we've also set to True. This is one of those inplace row-level ordering

/conversion/tmp/activity_task_scratch/549519163.docx
14-Oct-21 549519163.docx 33

operations which we've learned to be careful of. When we now invoke the .head
method on this sorted data frame. We can see that the ROW_IDs now are in a different
order. Specifically, they are in the order of the ORDER_DATE. So the first order date that
we have in our data set is the first of January 2011. Now that we've correctly
reinterpreted the ORDER_DATE and SHIP_DATE columns as datetimes, we can perform
filtering operations as well.

He enters a set of commands in code cell 70. The first command line is: start_date =
'01-01-2011'. The second command line is: end_date = '31-01-2011'.

On screen now, we've defined two strings, start_date and end_date. Both of these
strings are in an appropriate and acceptable date format. The first of these corresponds
to the first of January 2011, that's the start date, and the end date is the 31st of January
2011. Next, we'll make use of the start_date and end_date variables in order to only get
all of the orders for jan2011.

He enters the following command in code cell 71: order_jan2011 =


(superstore_data[ 'ORDER_DATE'] >= start_date) & (superstore_data[ 'ORDER_DATE']
<= end_date).

We do this by checking whether the ORDER_DATE column in our superstore_data data


frame is greater than or equal to the start date. And so that's the logical and operator,
just a single ampersand. Whether the superstore_data ORDER_DATE is less than or
equal to the end_date. The result of this operation is stored in a variable called
order_jan2011. This is only going to consist of True and False values.

In other words, this is merely a filter. Once we have this filter, however, we can go
ahead and apply to our data frame because the ROW_IDs are going to be preserved in
the filter. This is going to correctly filter out all of the rows in the superstore_data data
frame where the date was in January. And this is done using the lock property, you can
see that we've specified lock with square brackets and then the order of jan2011.

He enters a set of commands in code cell 72. The first command line is: jan_data =
superstore_data.loc [order_jan2011]. The second command line is: jan_data.head().

 Let's execute this code and examine the head of this data frame. And we can see that
the ORDER_DATE is indeed always in January 2011. The rows in the head all have the
date of first Jan2011. Let's repeat this operation with the tail just to be doubly sure that
we haven't introduced any bugs.

He enters the following command in code cell 73: jan_data.tail().

/conversion/tmp/activity_task_scratch/549519163.docx
14-Oct-21 549519163.docx 34

And when we do this, we can see that the tail rows all have ORDER_DATE of 31st
January 2011. So we have indeed selected rows between a start_date and an end_date.
And we've done so by using datetime processing operations. This output which you see
on screen for the jan_data data frame was obtained using the R-style syntax. Let's now
repeat a similar calculation using the query method and the Python style syntax. On
screen now, we have another data frame called data_2014.

He enters a set of commands in code cell 74. The first command line is: data_2014 =
superstore_data.query("ORDER_DATE >= '01-01-2014' & ORDER_DATE <= '31-12-
2014'"). The second command line is: data_2014.tail().

This is obtained by invoking the query method on our data frame.

That query method takes in a predicate, ORDER_DATE >= '01-01-2014' & ORDER_DATE
<= '31-12-2014'. Notice again the use of the single ampersand for the logical or and how
the date boundaries have been specified in strings delimited by single quotes. Once
again, we see just how much easier and simpler the query method and the Python style
syntax is. We execute this and run the tail method.

And from the results, we can indeed see that all of the ORDER_DATES in the result are in
2014. We've successfully implemented various datetime operations using both the R-
style syntax and the Python style syntax with the query method.

/conversion/tmp/activity_task_scratch/549519163.docx
14-Oct-21 549519163.docx 35

8. Selecting and Dropping Columns


Topic title: Selecting and Dropping Columns. Your host for this session is Vitthal
Srinivasan.

In this demo, we're going to turn our attention away from querying, filtering and
searching data frames. And instead start to focus on dropping rows and columns from
pandas data frames. Because this is a brand new workbook, let's start by re-importing
pandas with the usual alias of pd.

He enters the following command in the first code cell: import pandas as pd.

Once that's done, let's go ahead and customize a couple of pandas options.

Let's specify that we want display.max.columns to be set to None. This means that
whenever we display a pandas data frame, all columns are going to be displayed. We'll
just need to scroll way over to the right in order to examine them.

He enters a set of commands in code cell 2. The first command line is:
pd.set_option( 'display.max.columns ' , None). The second command line is:
pd.set_option( 'display.precision', 2).

This ensures that no columns will be replaced by the ellipsis. Second option, we specify
is display.precision and we give this the value 2. This has the effect of printing all floating
point values with exactly two places after the decimal point. It just makes it a lot easier
to examine floating point values. We've encountered the pandas set_option method
before.

Remember that this sets class level properties so it's going to influence all pandas data
frames. Once these options have been set, let's read in data using pd.read_csv. The file
that we are reading in is in the Datasets folder and the name is

He enters a set of commands in code cell 3. The first command line is: loan_data =
pd.read_csv( 'Datasets/loan.csv'). The second command line is: loan_data.head ( ).

loan.csv. So we are changing Datasets for this demo. As usual we invoke the head
method on this loan_data data frame. We can see the first five rows, the labels are 0
through 4.

So those are the default labels and we can also see all of the columns. We can scroll way
over to the right and we see that there are no ellipses. So every column has indeed been

/conversion/tmp/activity_task_scratch/549519163.docx
14-Oct-21 549519163.docx 36

displayed. Next, let's use the shape property in order to see how many rows and
columns the loan data has.

He enters the following command in code cell 4: loan_data.shape.

We can see from this that as usual, we have a tuple with two numbers. The first gives
the number of rows, the second gives the number of columns. There are 30 columns
and a really large number of rows, 887379. That's close to a million rows.

Let's examine the column names by invoking the .columns property. As usual, this
returns an Index object.

He enters the following command in code cell 5: loan_data.columns.

This encapsulates a list which has the names of all 30 columns. This entire demo is going
to be focused on dropping columns from this data frame in the right way. Let's get
started. We first drop a single column, this is the column called home_ownership.

He enters a set of commands in code cell 6. The first command line is: loan_data_trim
= loan_data.drop( 'home_ownership' , axis=1). The second command line is:
loan_data_trim.head ( ).

We are only specifying the one input argument in addition to the column name and
that's the axis,

axis equal to 1 corresponds to column operations, axis equal to 0 corresponds to row


operations. By specifying axis equal to 1, we are letting pandas know that it's a column
name that we have specified as the first input argument. As we shall see, it's also
possible to specify inplace equal to True. When you don't specify the inplace name
parameter, inplace is default False. And this means that this column is going to be
dropped, but not from the original data frame. Instead, a copy is going to be created
and returned. We cache that copy in a variable called loan_data_trim.

Let's go ahead and invoke the head method on this, we can see that we no longer have
the column we just removed. We scroll left to right, but the home_ownership column
doesn't appear. We can also confirm that there's one less column than we had
previously by checking the shape property. He enters the following command in code
cell 7: loan_data_trim.shape. The number of rows remains the same, but the number
of columns has reduced, it's now at 29 instead of 30. Again, please note that we are
examining loan_data_trim, the original data frame is unchanged and that's called just
loan_data.

/conversion/tmp/activity_task_scratch/549519163.docx
14-Oct-21 549519163.docx 37

We can also invoke the columns property on loan_data_trim.

He enters the following command in code cell 8: loan_data_trim.columns.

This has the Index object, this Index object no longer contains the column
home_ownership. Just to be absolutely sure, let's also invoke the columns property on
the original data frame, that's loan_data.

He enters the following command in code cell 9: loan_data.columns.

And here when we run this, we find that the home_ownership column does indeed
appear. You can see it over at the extreme right of the first line, it's prominently missing
from the corresponding output in loan_data_trim.

Let's now repeat the same drop operation but with one crucial difference. This time we
are going to specify inplace to be equal to True. Now on screen you can see we've
invoked the drop method on loan_data.

He enters a set of commands in code cell 10. The first command line is:
loan_data.drop( 'home_ownership' , axis=1, inplace=True). The second command line
is: loan_data.head( ).

First input argument is the column, that's still home_ownership. The second input
argument is axis which is equal to 1 because this is a column operation. And the third
argument is inplace which is equal to True. Because this is an inplace operation, we do
not need another variable to hold the return value.

We simply invoke head on loan_data, and we can scroll right and satisfy ourselves that
the home_ownership column has disappeared. Let's also check the shape, this has also
changed, the number of columns has dropped from 30 to 29.

He enters the following command in code cell 11: loan_data.shape.

And finally, we can reinvoke the columns property, this is going to give us a list of all of
the columns in loan_data.

He enters the following command in code cell 12: loan_data.columns.

 And we can see that home_ownership, which used to appear at the extreme right of
the first line, no longer shows up there. We now know how to drop columns inplace as
well as not inplace, let's move on and perform some more drop operations.

/conversion/tmp/activity_task_scratch/549519163.docx
14-Oct-21 549519163.docx 38

He enters a set of commands in code cell 13. The first command line is: loan_data_trim
= loan_data_trim.drop(['income_category', 'term'', 'id' ], axis = 1). The second
command line is: loan_data_trim.head ( ).

This time we are going to drop multiple columns at one go. We specify the columns we
wish to drop in a list, the columns are income_category, term, and id. All of these are
enclosed within square brackets. We also specify the axis which is equal to 1 because
this is a column operation. Note that we omit the inplace name parameter and that's
why by default, it's going to be treated as False. This means that a copy is going to be
made.

Now however, we'll do something a little tricky with that copy, we assign that copy back
into the same variable, that's loan_data_trim. The effect of doing it in this way is the
same as performing an inplace drop. This point is an important one. If you perform a
drop which is not inplace, but then you go ahead and store the return value in the same
variable. You've achieved the same effect as an inplace drop. Let's confirm that this is
the case, we run the head command on loan_data_trim.

We can see that we no longer have income_category, term or id. We can also confirm
this using the shape property, we've lost three more columns so the number of columns
has dropped from 29 to 26.

He enters the following command in code cell 14: loan_data_trim.shape.

We can also check this using the columns property, this has shrunk as well.

He enters the following command in code cell 15: loan_data_trim.columns.

So this was yet another way, a roundabout way of performing an inplace drop. Let's
explore some other ways of dropping columns, let's create a list. This is called
columns_list and it has two elements, application_type and purpose.

He enters the following command in code cell 16: columns_list = [ 'application_type',


'purpose'].

Once we create this variable, we can pass it in to the drop method. So we've once again
invoked the drop method on loan_data_trim.

He enters a set of commands in code cell 17. The first command line is: loan_data_trim
= loan_data_trim.drop(columns_list, axis=1). The second command line is:
loan_data_trim.head ( ).

/conversion/tmp/activity_task_scratch/549519163.docx
14-Oct-21 549519163.docx 39

We pass in columns_list as the first input argument, and we pass in axis equal to 1 as the
second input argument. This means it's a column-wise operation. This time again, we've
omitted inplace so by default it's going to be False. But then we've gone ahead and
stored the return value in the variable of the same name. So once again, we have the
same effect. We've performed an inplace drop even though we've not specified inplace
equal to True.

Let's go ahead and run this command, we can check visually that we don't see
application_type or purpose. We can scroll left and right to confirm this, then let's run
the shape property. Two more columns have been dropped so we are down to 24 from
26.

He enters the following command in code cell 18: loan_data_trim.shape.

Finally, we can check the columns property and that has shrunk as well.

He enters the following command in code cell 19: loan_data_trim.columns.

Please note that all of these changes were on loan_data_trim. Let's now turn our
attention back to the original data frame, which was just called loan_data.

He enters the following command in code cell 20: loan_data.columns.

Let's examine the columns. You can see that we still have all of those original columns,
barring the one column we dropped, which was home_ownership. If we check the
shape of this, we can see that we have 29 columns in loan_data.

He enters the following command in code cell 21: loan_data.shape.

And we can Index into those columns using the syntax on screen now.

He enters the following command in code cell 22: loan_data.columns[12].

loan_data.columns followed by a pair of square brackets containing an Index. That Index


here is 12. This corresponds to the column with the name application_type. We can use
this Index operator in order to drop this column.

Here on screen we've invoked the drop method on loan_data.

He enters a set of commands in code cell 23. The first command line is: loan_data_trim
= loan_data.drop( loan_data.columns[12], axis=1). The second command line is:
loan_data_trim.head ( ).

/conversion/tmp/activity_task_scratch/549519163.docx
14-Oct-21 549519163.docx 40

The first input argument is loan_data.columns with the Index of 12, the second input
argument is axis equal to 1. We've saved the return value in loan_data_trim. Now
remember that just before we ran this line of code, we had already created a trimmed
down version of loan_data. So loan_data_trim used to have 24 columns, however, after
we run this code, loan_data_trim is going to have 28 columns. Let's confirm that this is
the case, we check the shape property of loan_data_trim.

And indeed, this is true, we are back at 28 columns.

He enters the following command in code cell 24: loan_data_trim.shape.

Let's now go back to the original loan_data data frame, lets Index into the columns with
Indices 0, 14, and 17.

He enters the following command in code cell 26: loan_data.columns, [0],


loan_data.columns, [14], loan_data.columns, [17].

We've printed these out separated by commas and so this returns a tuple. That tuple
has three elements, the column names id, purpose, and interest_payment_cat. Let's
package these three Indexes up into a list, let's collect columns_list and it has the
Indexes 0, 14, and 17.

He enters the following command in code cell 27: columns_list = [0, 14, 17].

 And then let's go ahead and pass this in to draw.

He enters a set of commands in code cell 28. The first command line is: loan_data_trim
= loan_data_trim.drop( loan_data.columns[columns_list], axis=1). The second
command line is: loan_data_trim.head ( ).

As usual, we specify axis is equal to 1, and save the return value in loan_data_trim.

We can examine the return values, as well as the shape of loan_data_trim. We are back
down to 25 columns, we had 28 columns, we've eliminated 3 and so we now have 25.

He enters the following command in code cell 29: loan_data_trim.shape.

 We've now got a good handle on some relatively simple ways of dropping columns from
a data frame. In the demo coming up ahead, we'll move to even more complex ways
which make use of regular expressions and other pattern matches.

/conversion/tmp/activity_task_scratch/549519163.docx
14-Oct-21 549519163.docx 41

9. Applying Advanced Techniques to Select and Drop Columns


Topic title: Applying Advanced Techniques to Select and Drop Columns. Your host for
this session is Vitthal Srinivasan.

In this demo, we are going to continue working with the loan data DataFrame. At this
point, this DataFrame has 29 columns and 887379 rows.

He enters the following command in code cell 31: loan_data.shape.

This demo is going to continue focusing on dropping rows and columns.

He enters the following command in code cell 32: loan_data.columns.str.contains('


^total').

Let's begin by applying a regular expression. We apply the contains method to the string
representation of the columns. And we pass in a regular expression with the carrot sign
followed by the word total. The carrot symbol matches the start of a string. So this is
going to return True for every column name which begins with the word total. There are
two columns for which this condition is True; for the remaining 27 it is False. Let's now
go ahead and use the lock property to filter out all of those columns which start with the
word total.

He enters a set of commands in code cell 33. The first command line is: loan_data_trim
= loan_data.loc[:, ~loan_data.columns.str.contains(' ^total')]. The second command
line is: loan_data_trim.head ( ).

For this, we're making use of the same expression that we have above in cell 32.

But we preceded it with a tilde symbol that has the effect of negating or reversing the
True and False values. We've saved the result of this operation in load_data_trim. So
load_data_trim now has 27 columns; the original 29 minus two, which started with the
word total. Let's now perform a similar operation but this time we will only exclude
columns which end with a certain pattern.

He enters the following command in code cell 34: loan_data_trim.columns.

We first examine all of the columns in loan_data_trim; there are 27 of them, one of
them is called grade. Another is called grade_cat.

Next we're going to learn how to drop only the column which ends with the word grade.

/conversion/tmp/activity_task_scratch/549519163.docx
14-Oct-21 549519163.docx 42

He enters a set of commands in code cell 35. The first command line is: loan_data_trim
= loan_data_trim.loc[:, ~loan_data_trim.columns.str.contains(' ^grade$')]. The second
command line is: loan_data_trim.head ( ).

The syntax for this, which is on screen now is very similar to the previous drop
operation. The only major difference is in the form of the regular expression. Here the
regular expression is grade followed by the dollar sign. That dollar sign is going to match
the end of a column name. The overall structure of this code is very similar. Notice the
use of the tilde a keyword in order to reverse. The match on the regular expression and
the use of the lock property.

Let's go ahead and run this. And you can see that we've now lost the column titled
grade. We do still have the column titled grade_cat. That's because of the dollar sign
which only matches the grade word if it appears at the very end of a column. Let's also
examine the shape of loan_data_trim.

He enters the following command in code cell 36: loan_data_trim.shape.

We can see that at this point, we still have all of the rows. So there is no decrease in the
number of rows, but the number of columns has declined by one, and it's now at 26.

We've now learned how to eliminate all columns which match a particular regular
expression from a DataFrame. Let's move on to something a little more complicated.
Let's learn how to drop an entire range of columns from a DataFrame. For this, let's
begin by examining the columns in loan_data_trim.

He enters the following command in code cell 37: loan_data_trim.columns.

Let's say that we wish to drop all columns from interest_payments, up to


loan_condition. The way to do this is on screen now. We invoke the drop method. What
we pass into the drop method is another DataFrame, which is loan_data_trim.loc.

He enters a set of commands in code cell 38. The first command line is: loan_data_trim
= loan_data_trim.drop(\. The second command line is:
loan_data_trim.loc[:,'interest_payments' : 'loan_condition' ].columnns, axis=1). The
third command line is: loan_data_trim.head ( ).

And in the loc method, we specify all of the columns starting with interest_payments
and going on to loan_condition. Note the use of the columns property as well as the axis
equal to one. You find that when we run this code, we've now eliminated all of the
columns from interest_payment through loan_condition. Those columns would have

/conversion/tmp/activity_task_scratch/549519163.docx
14-Oct-21 549519163.docx 43

appeared after purpose_cat and before loan_condition_cat, but they don't show up in
the output. So we've successfully managed to drop a range of columns. If we now
examine the shape of this DataFrame, we find that we've got only 23 columns left.

He enters the following command in code cell 39: loan_data_trim.shape.

We can also double check that these columns have been eliminated by examining the
.columns property.

He enters the following command in code cell 40: loan_data_trim.columns.

And when we do this, we find that indeed, all of those have been eliminated from our
DataFrame. Now, let's turn our attention to dropping columns using iloc, rather than loc.

He enters the following command in code cell 41: loan_data_trim.iloc[:, 15: 17]
.sample( ).

On screen now, we've performed an iloc on loan_data_trim, we passed in the column


index positions 15 through 17. We can see that the return value includes just two
columns, purpose_cat and loan_condition_cat. Remember that with iLoc, unlike with
loc, the end value of the range is not included in the filter.

He enters a set of commands in code cell 42. The first command line is: loan_data_trim
= loan_data_trim.drop(loan_data_trim.iloc[:, 15: 17], axis=1). The second command
line is: loan_data_trim.head ( ).

Let's now go ahead and drop these two columns. You can see how this is done. We have
loan_data_trim.drop, and then we pass in the expression which we just had in cell 41
loan_data_trim.iLoc. 15 through 17 axis is equal to one, we've omitted the in place
name parameter. But we do save this back into loan_data_trim. And so that has the
effect of dropping these 2 columns as well. Let's scroll left to right and we can confirm
that purpose_cat and loan_condition_cat no longer show up in the state of frame.

He enters the following command in code cell 43: loan_data_trim.shape.

We can also verify the shape the number of columns has dropped to 21 from 23.

He enters the following command in code cell 44: loan_data_trim.columns.

And finally, let's also just run the columns property, those two columns don't show up
there either.

/conversion/tmp/activity_task_scratch/549519163.docx
14-Oct-21 549519163.docx 44

We've now explored a really wide range of ways of dropping columns. Let's turn our
attention to dropping rows.

He enters the following command in code cell 45: loan_data.shape.

For this let's go back to the original loan data DataFrame. When we left it, it had 29
columns, as well as all of the rows from the original data. Let's now see how we can
start dropping rows.

He enters a set of commands in code cell 46. The first command line is: loan_data_trim
= loan_data.drop([2, 4], axis= 0). The second command line is: loan_data_trim.head
( ).

The simplest way of doing this is simply by specifying some row labels all we've got to
change is the axis value. We now have axis set to be equal to zero. We've passed in a list
of two row labels 2 and 4. When we run this, we can see from the row labels over on
the extreme left, that we no longer have the row labels 2 and 4.

So we have 0, 1, and then 3, 5 and 6. Let's examine the shape we can see that the
number of columns is unchanged. It's still 29, but the number of rows has declined by
two. 

He enters the following command in code cell 47: loan_data_trim.shape.

Next, let's see how to drop all rules, which do not satisfy a specific condition.

He enters a set of commands in code cell 49. The first command line is: loan_data_trim
= loan_data_trim[loan_data_trim.purpose != 'other']. The second command line is:
loan_data_trim.head ( ).

On screen now, we have a condition. That condition is enclosed within square brackets
loan_data_trim.purpose not equal to other. And then we apply the results from this
condition into the indexing operator and we apply that to loan_data_trim. The return
value from this, is also saved in loan_data_trim.

We run this command, but it's a little hard to see whether it was successful or not. So
let's instead check for all of the unique values in the purpose column. We do this by
invoking unique on loan_data_trim.purpose. The return value clearly does not contain
the string other.

He enters the following command in code cell 50: loan_data_trim.purpose.unique( ).

/conversion/tmp/activity_task_scratch/549519163.docx
14-Oct-21 549519163.docx 45

So in this way, we have successfully eliminated all rows where that purpose was equal
to other. They can also confirm that the number of rows has reduced by examining the
shape property. It's substantially lower than it was some moments ago.

He enters the following command in code cell 51: loan_data_trim.shape.

The number of columns isn't changed, and it's still at 29. Now when we drop rows from
a DataFrame that causes the index labels to get out of sync, if we'd like to reset those
index labels, what we've got to do is make use of the reset index method.

He enters a set of commands in code cell 52. The first command line is: loan_data_trim
= loan_data_trim.reset_index(drop=True ). The second command line is:
loan_data_trim.head ( ).

 On screen now, you can see that we've invoked the reset index method.

And we've passed in the value of drop to be equal to True. We can view the output and
we see that the labels are back to being contiguous starting from zero. You might recall
that we had eliminated or dropped the rows with index labels 2 and 4. But those now
show up once again, because we've reset all of the drop index labels. Now on screen we
perform an operation where we drop exactly one row from our DataFrame.

He enters a set of commands in code cell 53. The first command line is: loan_data_trim
= loan_data_trim.drop(loan_data_trim.index[4] ). The second command line is:
loan_data_trim.head ( 10).

What row is that? Well, it's going to be the one with the label at index position 4,
because we've just reset all of the indexes that's the same as label 4. So when we run
this command, you can see that we have a row with label 3, and we have a row with
label 5.

But we do not have a row with label 4 because that's the row that we've just dropped.
But dropping rows in this manner can get tricky.

He enters a set of commands in code cell 54. The first command line is: loan_data_trim
= loan_data_trim.drop(loan_data_trim.index[[6, 8]] ). The second command line is:
loan_data_trim.head ( 10).

On screen now we performed a very similar operation, we are dropping the rows which
are at index positions 6 and 8. Can you guess without viewing the output? Which rows
are going to be dropped? Well, if you guessed that it's going to be the rows with labels 6

/conversion/tmp/activity_task_scratch/549519163.docx
14-Oct-21 549519163.docx 46

and 8, that answer is wrong. As you can see, the rows we are missing have labels 7 and
9.

And that's because we've already dropped one row and so the index labels were out of
sync. At position 6, we had row label 7, and at position 8, we had row label 9, and that's
where the output is missing row labels 7 and 9. This is yet another example of how
tricky it gets the moment index values or index labels are no longer contiguous starting
from 0. Let's round out this conversation about dropping rows with a look at negative
row indexes.

He enters a set of commands in code cell 56. The first command line is: loan_data_trim
= loan_data_trim.drop(loan_data_trim.index[-3] ). The second command line is:
loan_data_trim.tail( ).

Negative indexes in Python work from the end of the list. So here we are effectively
dropping the third last row. To see what this looks like. Let's view the output of the tail
command. You can see that the ROW_IDs end with 844483. You can also see that we
are missing 844481, because that was the third last row. That's how we managed to
drop the third row from the end using the index of negative three. Negative indexes are
pretty handy anytime you would like to work with indexes towards the end of the list.

He enters a set of commands in code cell 57. The first command line is: loan_data_trim
= loan_data_trim[:-5]. The second command line is: loan_data_trim.tail( ).

Here for instance, we are selecting all rows starting from the first row up to but not
including the last five rows. And that's done using the syntax on screen. When we
examine the output of this using the tail command, we can see that the last ROW_ID
that we've included is 844477. So that has eliminated the rows which appear after this.
This gets us to the end of our exploration of dropping rows and columns. As you can see,
this is quite complex. It's especially complicated to drop rows and that's because the
row labels start to get out of sync, if you're not careful, out of sync row labels can lead
to pretty nasty bugs and gotchas.

/conversion/tmp/activity_task_scratch/549519163.docx
14-Oct-21 549519163.docx 47

10. Course Summary


Topic title: Course Summary

We've now come to the end of this course, Filtering Data in Pandas. We started this
course by exploring how complex data filter operations can be performed using the loc
and iloc functions. We also learned how the unique values and the number of rows with
those unique values form a column can be accessed using the value_counts method. We
then saw how columns can be filtered based on their data type using the select_dtypes
function. How ranges of rows can be accessed using either the loc method, which takes
in labels, or the iloc method, which takes in index positions rather than labels.

We then move to filtering data. We did so in two ways, using first the pandas filter and
query methods, which follow a pythonic idiom. And then, the classic pandas lookup
syntax, which derives heavily from R syntax. We saw how the filter function accepts
wildcards as well as regular expressions, and then used various methods such as the .isin
method to filter data. Along the way, we learned handy little tricks such as the use of
the tilde symbol to reverse a condition.

Finally, we learned how to correctly drop rows and columns from a pandas dataframe.
We saw how columns could be dropped based on their index positions in a dataframe,
and used a regular expression with the loc function to find all columns that satisfy a
condition. And then eliminated all those columns from our object. We then moved on to
learn how rows can be dropped in Pandas. For instance, we saw how the drop method
can be used to drop rows based on their index position. We became aware of a possible
gotcha that arises when we drop rows in place but then neglect to reset the index labels
in our dataframe object. You now have a solid grasp of filtering data in pandas using the
loc and iloc methods. And are aware of best practices in dropping rows and columns, as
well as of some possible pitfalls that you might encounter during such operations.

You are now well-positioned to move on to cleaning and analyzing data in pandas, that's
in the course coming up next.

/conversion/tmp/activity_task_scratch/549519163.docx
14-Oct-21 549519163.docx 48

11. Test

/conversion/tmp/activity_task_scratch/549519163.docx
14-Oct-21 549519163.docx 49

/conversion/tmp/activity_task_scratch/549519163.docx
14-Oct-21 549519163.docx 50

/conversion/tmp/activity_task_scratch/549519163.docx
14-Oct-21 549519163.docx 51

/conversion/tmp/activity_task_scratch/549519163.docx
14-Oct-21 549519163.docx 52

/conversion/tmp/activity_task_scratch/549519163.docx

You might also like