0% found this document useful (0 votes)
10 views27 pages

CSL 410 L17

Uploaded by

rpschauhan2003
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
10 views27 pages

CSL 410 L17

Uploaded by

rpschauhan2003
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 27

Program:B.

Tech(CSE) IV Semester II Year

CSL-410: Data Science using Python


Unit No. 2
View and Filter Data from data frames

Lecture No. 17

Dr. Sanjay Jain


Associate Professor, CSA/SOET
Outlines
• Preview and examine data in a Pandas DataFrame
• Filter Data Frames Based on Value Condition
• Examples
• References
Student Effective Learning Outcomes(SELO)
01: Ability to understand subject related concepts clearly along with
contemporary issues.
02: Ability to use updated tools, techniques and skills for effective domain
specific practices.
03: Understanding available tools and products and ability to use it
effectively.
Preview and examine data in a Pandas DataFrame
• Once you have data in Python, you’ll want to see the data has loaded, and
confirm that the expected columns and rows are present.
• Print the data : If you’re using a Jupyter notebook, outputs from simply
typing in the name of the data frame will result in nicely formatted
outputs. Printing is a convenient way to preview your loaded data, you can
confirm that column names were imported correctly, that the data
formats are as expected, and if there are missing values anywhere.

<SELO: 1> <Reference No.: R1,R4>


Preview and examine data in a Pandas DataFrame
• Print the data : This is an excellent way to preview data, however notes
that, by default, only 100 rows will print, and 20 columns.
• If you’d like to change these limits, you can edit the defaults using some
internal options for Pandas displays.
pd.options.display.XX = value
• pd.options.display.width – the width of the display in characters – use this
if your display is wrapping rows over more than one line.
• pd.options.display.max_rows – maximum number of rows displayed.
• pd.options.display.max_columns – maximum number of columns
displayed.

<SELO: 1> <Reference No.: R1,R4>


DataFrame rows and columns with .shape
• The shape command gives information on the data set size – ‘shape’
returns a tuple with the number of rows, and the number of columns for
the data in the DataFrame. Another descriptive property is the ‘ndim’
which gives the number of dimensions in your data, typically 2.

<SELO: 1> <Reference No.: R1,R4>


Preview DataFrames with head() and tail()
• The DataFrame.head() function in Pandas, by default, shows you the top 5
rows of data in the DataFrame. The opposite is DataFrame.tail(), which
gives you the last 5 rows.

<SELO: 1> <Reference No.: R1,R4>


Preview DataFrames with head() and tail()
• Pass a number in head() and tail() will print out the specified number of
rows as shown in the example below.

<SELO: 1> <Reference No.: R1,R4>


Data types (dtypes) of columns
• Many DataFrames have mixed data types, that is, some columns are
numbers, some are strings, and some are dates etc. Internally, CSV files do
not contain information on what data types are contained in each column;
all of the data is just characters.
• Pandas infers the data types when loading the data, e.g. if a column
contains only numbers, pandas will set that column’s data type to
numeric: integer or float.
• We can check the types of each column in our example with
the ‘.dtypes’ property of the dataframe.

<SELO: 1> <Reference No.: R1,R4>


Change the Data types (dtypes) of columns
• To change the datatype of a specific column, use the .astype() function.
For example, to see the ‘Item Code’ column as a string, use:
data['Item Code'].astype(str)

<SELO: 1> <Reference No.: R1,R4>


Describing data with .describe()
• For numeric columns, describe() returns basic statistics: the value count,
mean, standard deviation, minimum, maximum, and 25th, 50th, and 75th
quantiles for the data in a column.
• For string columns, describe() returns the value count, the number of
unique entries, the most frequently occurring value (‘top’), and the
number of times the top value occurs (‘freq’)
• Select a column to describe using a string inside the [] braces, and call
describe() as follows:

<SELO: 1> <Reference No.: R1,R4>


Describing data with .describe()
• Use the describe() function to get basic statistics on columns in your
Pandas DataFrame. Note the differences between columns with numeric
datatypes, and columns of strings and characters.

<SELO: 1> <Reference No.: R1,R4>


Describing data with .describe()
• if describe is called on the entire DataFrame, statistics only for the
columns with numeric datatypes are returned, and in DataFrame format.

<SELO: 1> <Reference No.: R1,R4>


Filter Data Frames Based on Value Condition
• One of the biggest advantages of having the data as a Pandas Dataframe is
that Pandas allows us to slice and dice the data in multiple ways.
• Often, you may want to subset a pandas dataframe based on one or more
values of a specific column. Essentially, we would like to select rows based
on one value or multiple values present in a column.
• Let us take an examples using Pandas dataframe to filter rows or select
rows based values of a column(s). Let us first load gapminder data as a
dataframe into pandas.
# load pandas
import pandas as pd
data_url = 'http://bit.ly/2cLzoxH'
# read data from url as pandas dataframe
gapminder = pd.read_csv(data_url)

<SELO: 1> <Reference No.: R1,R4>


Filter Data Frames Based on Value Condition
• This data frame has over 6000 rows and 6 columns. One of the columns is
year. Let us look at the first three rows of the data frame.

print(gapminder.head(3))
country year pop continent lifeExp gdpPercap
0 Afghanistan 1952 8425333.0 Asia 28.801 779.445314
1 Afghanistan 1957 9240934.0 Asia 30.332 820.853030
2 Afghanistan 1962 10267083.0 Asia 31.997 853.100710

<SELO: 1> <Reference No.: R1,R4>


Filter Data Frames Based on Value Condition
• How to Select Rows of Pandas Dataframe Based on a Single Value of a
Column?
• One way to filter by rows in Pandas is to use boolean expression. We first
create a boolean variable by taking the column of interest and checking if
its value equals to the specific value that we want to select/keep.
• For example, let us filter the dataframe or subset the dataframe based on
year’s value 2002. This conditional results in a boolean variable that
has True when the value of year equals 2002, False otherwise.
# does year equals to 2002?
# is_2002 is a boolean variable with True or False in it
is_2002 = gapminder['year']==2002
# filter rows for year 2002 using the boolean variable
gapminder_2002 = gapminder[is_2002]
print(gapminder_2002.shape)
print(gapminder_2002.head())
<SELO: 1> <Reference No.: R1,R4>
Filter Data Frames Based on Value Condition
• How to Select Rows of Pandas Dataframe Based on a Single Value of a
Column?
• Output:
(142, 6)
country year pop continent lifeExp gdpPercap
Afghanistan 2002 25268405.0 Asia 42.129 726.734055
Albania 2002 3508512.0 Europe 75.651 4604.211737
Algeria 2002 31287142.0 Africa 70.994 5288.040382
Angola 2002 10866106.0 Africa 41.003 2773.287312
Argentina 2002 38331121.0 Americas 74.340 8797.640716

<SELO: 1> <Reference No.: R1,R4>


Filter Data Frames Based on Value Condition
• How To Filter rows using Pandas chaining?
• We can also use Pandas chaining operation, to access a dataframe’s column
and to select rows like previous example. Pandas chaining makes it easy to
combine one Pandas command with another Pandas command or user
defined functions.
• Here we use Pandas eq() function and chain it with the year series for
checking element-wise equality to filter the data corresponding to year
2002.
# filter rows for year 2002 using the boolean expression
gapminder_2002 = gapminder[gapminder.year.eq(2002)]
print(gapminder_2002.shape)
(142, 6)
• In the above example, we checked for equality (year==2002) and kept the
rows matching a specific value. We can use any other comparison operator
like “less than” and “greater than” and create boolean expression to filter
rows of pandas dataframe.
<SELO: 1> <Reference No.: R1,R4>
Filter Data Frames Based on Value Condition
• How to Select Rows of Pandas Dataframe Whose Column Value Does
NOT Equal a Specific Value?
• Sometimes, you may want tot keep rows of a data frame based on values of
a column that does not equal something. Let us filter our gapminder
dataframe whose year column is not equal to 2002. Basically we want to
have all the years data except for the year 2002.
# filter rows for year does not equal to 2002
gapminder_not_2002 = gapminder[gapminder.year != 2002]
gapminder_not_2002 = gapminder[gapminder['year']!=2002]
gapminder_not_2002.shape
(1562, 6)

<SELO: 1> <Reference No.: R1,R4>


Filter Data Frames Based on Value Condition
• How to Select Rows of Pandas Dataframe Whose Column Value is
NOT NA/NAN?
• Often you may want to filter a Pandas dataframe such that you would like
to keep the rows if values of certain column is NOT NA/NAN.
• We can use Pandas notnull() method to filter based on NA/NAN values of a
column.
# filter out rows ina . dataframe with column year values NA/NAN
gapminder_no_NA = gapminder[gapminder.year.notnull()]

<SELO: 1> <Reference No.: R1,R4>


Filter Data Frames Based on Value Condition
• How to Select Rows of Pandas Dataframe Based on a list?
• In the previous example, we selected rows based on single value, i.e. year
== 2002.
• However, often we may have to select rows using multiple values present
in an iterable or a list. For example, let us say we want select rows for years
[1952, 2002].
• Pandas dataframe’s isin() function allows us to select rows using a list or
any iterable. If we use isin() with a single column, it will simply result in a
boolean variable with True if the value matches and False if it does not.
#To select rows whose column value is in list
years = [1952, 2007]
gapminder.year.isin(years)

<SELO: 1> <Reference No.: R1,R4>


Filter Data Frames Based on Value Condition
• How to Select Rows of Pandas Dataframe Based on a list?
• We can use the boolean array to select the rows like before
gapminder_years= gapminder[gapminder.year.isin(years)]
gapminder_years.shape
(284, 6)
• We can make sure our new data frame contains row corresponding only the
two years specified in the list. Let us use Pandas unique function to get the
unique values of the column “year”
gapminder_years.year.unique()
array([1952, 2007])

<SELO: 1> <Reference No.: R1,R4>


Filter Data Frames Based on Value Condition
• How to Select Rows of Pandas Dataframe Based on Values NOT in a
list?
• We can also select rows based on values of a column that are not in a list or
any iterable. We will create boolean variable just like before, but now we
will negate the boolean variable by placing ~ in the front.
• For example, to get rows of gapminder data frame whose column values
not in the continent list, we will use
continents = ['Asia','Africa', 'Americas', 'Europe']
gapminder_Ocean = gapminder[~gapminder.continent.isin(continents)]
gapminder_Ocean.shape
(24,6)

<SELO: 1> <Reference No.: R1,R4>


Filter Data Frames Based on Value Condition
• How to Select Rows of Pandas Dataframe using Multiple Conditions?
• We can combine multiple conditions using & operator to select rows from a
pandas data frame. For example, we can combine the above two conditions
to get Oceania data from years 1952 and 2002.
gapminder[~gapminder.continent.isin(continents) &
gapminder.year.isin(years)]
• Now we will have rows corresponding to the Oceania continent for the
years 1957 and 2007.
country year pop continent lifeExp gdpPercap
Australia 1952 8691212.0 Oceania 69.120 10039.59564
Australia 2007 20434176.0 Oceania 81.235 34435.36744
New Zealand 1952 1994794.0 Oceania 69.390 10556.57566
New Zealand 2007 4115771.0 Oceania 80.204 25185.00911

<SELO: 1> <Reference No.: R1,R4>


Learning Outcomes

The students have learn and understand the followings:

•Preview and examine data in a Pandas DataFrame


•Filter Data Frames Based on Value Condition
References

1. Data Science with Python by by Aaron England, Mohamed Noordeen


Alaudeen, and Rohan Chopra. Packt Publishing; July 2019
2. https://intellipaat.com/blog/what-is-data-science/
3. https://onlinecourses.nptel.ac.in/noc20_cs36/
Thank you

You might also like