Lab 2 More Python
Lab 2 More Python
More Python
Lab 2
School of Information Technology Management
Ted Rogers School of Management
Toronto Metropolitan University
Winter 2024
Data Frames groupby() method
In [ ]: #Calculate mean value for each numeric column per each group
df_rank.mean()
Modified based on: Python for Data Analysis, Research Computing Services, Boston University
Once groupby object is create we can calculate various statistics for each group:
Note: If single brackets are used to specify the column (e.g. salary), then the output is Pandas
Series object. When double brackets are used the output is a Data Frame
Modified based on: Python for Data Analysis, Research Computing Services, Boston University
Modified based on: Python for Data Analysis, Research Computing Services, Boston University
Modified based on: Python for Data Analysis, Research Computing Services, Boston University
Modified based on: Python for Data Analysis, Research Computing Services, Boston University
When selecting one column, it is possible to use single set of brackets, but
the resulting object will be a Series (not a DataFrame):
In [ ]: #Select column salary:
df['salary']
When we need to select more than one column and/or make the output to
be a DataFrame, we should use double brackets:
In [ ]: #Select column salary:
df[['rank','salary']]
Modified based on: Python for Data Analysis, Research Computing Services, Boston University
If we need to select a range of rows, we can specify the range using ":"
Notice that the first row has a position 0, and the last value in the range is
omitted:
So for 0:10 range the first 10 rows are returned with the positions starting
with 0 and ending with 9
Modified based on: Python for Data Analysis, Research Computing Services, Boston University
If we need to select a range of rows, using their labels we can use method
loc:
In [ ]: #Select rows by their labels:
df_sub.loc[10:20,['rank','sex','salary']]
Out[ ]:
Modified based on: Python for Data Analysis, Research Computing Services, Boston University
Out[ ]:
Modified based on: Python for Data Analysis, Research Computing Services, Boston University
Modified based on: Python for Data Analysis, Research Computing Services, Boston University
We can sort the data by a value in the column. By default the sorting will
occur in ascending order and a new data frame is return.
In [ ]: # Create a new data frame from the original sorted by the column Salary
df_sorted = df.sort_values( by ='service')
df_sorted.head()
Out[ ]:
Modified based on: Python for Data Analysis, Research Computing Services, Boston University
Out[ ]:
Modified based on: Python for Data Analysis, Research Computing Services, Boston University
Out[ ]:
Modified based on: Python for Data Analysis, Research Computing Services, Boston University
Modified based on: Python for Data Analysis, Research Computing Services, Boston University
Modified based on: Python for Data Analysis, Research Computing Services, Boston University
Modified based on: Python for Data Analysis, Research Computing Services, Boston University
Out[ ]:
Modified based on: Python for Data Analysis, Research Computing Services, Boston University
Out[ ]:
Modified based on: Python for Data Analysis, Research Computing Services, Boston University
Out[ ]:
Modified based on: Python for Data Analysis, Research Computing Services, Boston University
Modified based on: Python for Data Analysis, Research Computing Services, Boston University
min, max
count, sum, prod
mean, median, mode, mad
std, var
Modified based on: Python for Data Analysis, Research Computing Services, Boston University
agg() method are useful when multiple statistics are computed per column:
In [ ]: flights[['dep_delay','arr_delay']].agg(['min','mean','max'])
Out[ ]:
Modified based on: Python for Data Analysis, Research Computing Services, Boston University
df.method() description
describe Basic statistics (count, mean, std, min, quantiles, max)
kurt kurtosis
Modified based on: Python for Data Analysis, Research Computing Services, Boston University
In [ ]: %matplotlib inline
Modified based on: Python for Data Analysis, Research Computing Services, Boston University
description
distplot histogram
barplot estimate of central tendency for a numeric variable
violinplot similar to boxplot, also shows the probability density of the
data
jointplot Scatterplot
regplot Regression plot
pairplot Pairplot
boxplot boxplot
swarmplot categorical scatterplot
factorplot General categorical plot
Modified based on: Python for Data Analysis, Research Computing Services, Boston University
Out[ ]:
Modified based on: Python for Data Analysis, Research Computing Services, Boston University