0% found this document useful (0 votes)
86 views22 pages

20 Pandas Functions For 80% of Your Data Science

The document discusses 20 commonly used Pandas functions that can help with 80% of data science tasks. It provides examples of functions like pd.read_csv() for reading CSV files, df.describe() for generating summary statistics, df.info() for getting a DataFrame summary, and df.plot() for creating various plot types from DataFrames. The document also covers functions for selecting rows and columns like df.iloc() and df.loc(), adding new columns with df.assign(), filtering with df.query(), and sorting with df.sort_values(). Mastering these functions can help improve data science workflow efficiency.

Uploaded by

HARRY
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
86 views22 pages

20 Pandas Functions For 80% of Your Data Science

The document discusses 20 commonly used Pandas functions that can help with 80% of data science tasks. It provides examples of functions like pd.read_csv() for reading CSV files, df.describe() for generating summary statistics, df.info() for getting a DataFrame summary, and df.plot() for creating various plot types from DataFrames. The document also covers functions for selecting rows and columns like df.iloc() and df.loc(), adding new columns with df.assign(), filtering with df.query(), and sorting with df.sort_values(). Mastering these functions can help improve data science workflow efficiency.

Uploaded by

HARRY
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 22

20 Pandas Functions for 80% of your Data Science

Tasks
levelup.gitconnected.com/20-pandas-functions-for-80-of-your-data-science-tasks-b610c8bfe63c

30 January 2023

Master these Functions and Get Your Work Done

Pandas is one of the most widely used libraries in the data science community and it’s a
powerful tool that can help you with data manipulation, cleaning, and analysis. Whether
you’re a beginner or an experienced data scientist, this article will provide valuable
insights into the most commonly used Pandas functions and how to use them practically.

We will cover everything from basic data manipulation to advanced data analysis
techniques, and by the end of this article, you will have a solid understanding of how to
use Pandas to make your data science workflow more efficient.

1/22
20 Pandas Functions for 80 % of Data Science Tasks by Youssef Hosni

If you want to start a career in data science & AI and do not know how. I offer
data science mentoring sessions and long-term career mentoring:

Long-term mentoring:
Mentoring sessions:

Join Medium with my referral link - Youssef Hosni

Read every story from Youssef Hosni (and thousands of other writers on
Medium). Your membership fee directly supports…

youssefraafat57.medium.com

If you want to study Data Science and Machine Learning for free, check out
these resources:

2/22
Free interactive roadmaps to learn Data Science and Machine Learning by yourself.
Start here:
The search engine for Data Science learning resources (FREE). Bookmark your
favorite resources, mark articles as complete and add study notes.
Want to learn Data Science from scratch with the support of a mentor and a learning
community? Join this Study Circle for free:

1. pd.read_csv()
pd.read_csv is a function in the panda's library in Python that is used to read a CSV
(Comma Separated Values) file and convert it into a pandas DataFrame.

Example:

pandas pddf = pd.read_csv()

In this example, the pd.read_csv function reads the file ‘data.csv’ and converts it into a
DataFrame, which is then assigned to the variable ‘data’. The contents of the DataFrame
can then be printed using the print function.

It has many options like sep , header , index_col , skiprows , na_values etc.

= pd.read_csv(, sep=, header=, index_col=, skiprows=, na_values=)

This example reads the CSV file data.csv , with ; as a separator, the first row as the
header, the first column as an index, skipping the first 5 rows and replacing N/A with
NaN.

2. df.describe()
The df.describe() method in pandas is used to generate summary statistics of various
features of a DataFrame. It returns a new DataFrame that contains the count, mean,
standard deviation, minimum, 25th percentile, median, 75th percentile, and maximum of

3/22
each numerical column in the original DataFrame.

(df.describe())

You can also include or exclude certain columns and also include non-numerical columns
by passing appropriate arguments to the method.

df.describe(include=)

df.describe(exclude=)

4/22
3. df.info()
df.info() is a method in pandas that is used to get a concise summary of the
DataFrame, including the number of non-null values in each column, the data types of
each column, and the memory usage of the DataFrame.

(df.info())

4. df.plot()
df.plot() is a method in pandas that is used to create various types of plots from a
DataFrame. By default, it creates a line plot of all numerical columns in the DataFrame.
But you can also pass the argument kind to specify the type of plot you want to create.
Available options are line, bar, barh, hist, box, kde, density, area, pie, scatter, and
hexbin.

In the examples below I will use the .plot() method to plot numerical and categorical
variables. For the categorical variable, I will plot bar and pie plots and for the numerical
variable, I will plot box plots. You can try a different types of plots by yourself.

df[].value_counts().plot(kind=)

5/22
df[].value_counts().plot(kind=)

[].plot(kind=)

It also supports many other options like title , xlabel , ylabel , legend , grid ,
xlim , ylim , xticks , yticks etc. to customize the plot. You can also use
plt.xlabel() , plt.ylabel() , plt.title() etc. after the plot to customize it.

Please keep in mind that df.plot() is just a convenient wrapper around the
matplotlib.pyplot library, so the same customization options available in
matplotlib are available for df.plot() .

6/22
5. df.iloc()
Pandas’ .iloc() function is used to select rows and columns by their integer-based
index in a DataFrame. It is used to select rows and columns by their integer-based
location.

Here are some examples of how you can use it:

(df.iloc[])

(df.iloc[:])

(df.iloc[:, ])

(df.iloc[:, :])

7/22
(df.iloc[, ])

In the above examples, df.iloc[0] selects the first row of the dataframe,
df.iloc[:2] selects the first two rows, df.iloc[:, 0] selects the first column,
df.iloc[:, :2] selects the first two columns, and df.iloc[1, 1] selects the
element at the (1, 1) position of the dataframe (second row, second column).

Keep in mind that .iloc() only selects rows and columns based on their integer-based
index, so if you want to select rows and columns based on their labels you should use
.loc() method instead as will be shown next.

6. df.loc()
Pandas’ .loc() function is used to select rows and columns by their label-based index
in a DataFrame. It is used to select rows and columns by their label-based location.

Here are some examples of how you can use it:

(df.loc[:,])

8/22
(df.loc[:, [, ]])

In the above example, we used df.loc[:, 'Gender'] to select the column named
'Gender', and df.loc[:, ['Year of Birth', 'Gender']] selects the columns named
'Year of Birth' and 'Gender'.

7. df.assign()
Pandas’ .assign() function is used to add new columns to a DataFrame, based on the
computation of existing columns. It allows you to add new columns to a DataFrame
without modifying the original dataframe. The function returns a new DataFrame with
the added columns.

Here is an example of how you can use it:

df_new = df.assign(count_plus_5=df[] + )df_new.head()

9/22
In the above example, the first time df.assign() is used to create a new column named
'count_plus_5' with the value of count + 5.

It’s important to note that the original DataFrame df remains unchanged and the new
DataFrame df_new is returned with the new columns added.

The .assign() method can be used multiple times in a chain, allowing you to add
multiple new columns to a DataFrame in one line of code.

8. df.query()
Pandas’ .query() function allows you to filter a DataFrame based on a Boolean
expression. It allows you to select rows from a DataFrame using a query string similar to
SQL. The function returns a new DataFrame containing only the rows that satisfy the
Boolean expression.

Here is an example of how you can use it:

df_query = df.query()df_query.head()

df_query df.df_query.head

10/22
In the above example, the first time df.query() is used to select rows where the count
is greater than 30 and the rank is less than 30, and the second time df.query() is used
to select rows where gender is 'MALE'.

It’s important to note that the original DataFrame df remains unchanged and the new
DataFrame df_query is returned with the filtered rows.

The .query() method can be used with any valid Boolean expression and it's useful
when you want to filter a DataFrame based on multiple conditions or when the conditions
are complex and hard to express using the standard indexing operators.

Also, keep in mind that the .query() method is slower than boolean indexing, so if
performance is critical, you should use boolean indexing instead.

9. df.sort_values()
Pandas’ .sort_values() function allows you to sort a DataFrame by one or multiple
columns. It sorts the DataFrame based on the values of one or more columns, in
ascending or descending order. The function returns a new DataFrame sorted by the
specified column(s).

Here is an example of how you can use it:

df_sorted = df.sort_values(by=)df_sorted.head()

df_sorted = df.sort_values(by=, ascending=)df_sorted.head()

df_sorted = df.sort_values(by=[, ])df_sorted.head()

11/22
In the above example, the first time df.sort_values() is used to sort the DataFrame
by 'Count' in ascending order, the second time is used to sort by 'Rank' in descending
order, and the last time it's used to sort by multiple columns 'Count' and 'Rank'.

It’s important to note that the original DataFrame df remains unchanged and the new
DataFrame df_sorted is returned with the sorted values.

The .sort_values() method can be used with any column(s) of the DataFrame and it's
useful when you want to sort the DataFrame based on multiple columns, or when you
want to sort the DataFrame by a column in descending order.

10. df.sample()
Pandas’ .sample() function allows you to randomly select rows from a DataFrame. It
returns a new DataFrame containing the randomly selected rows. The function takes
several parameters that allow you to control the sampling process, such as the number of
rows to return, and whether or not to sample with replacement and seed for
reproducibility.

Here is an example of how you can use it:

df_sample = df.sample(n=, replace=, random_state=)df_sample

df_sample = df.sample(n=, replace=, random_state=)df_sample

df_sample = df.sample(n=, replace=, random_state=, axis=)df_sample

12/22
In the above example, the first time df.sample() is used to randomly select 2 rows
without replacement, the second time is used to randomly select 3 rows with replacement
and the last time is used to randomly select 2 columns without replacement.

It’s important to note that the original DataFrame df remains unchanged and the new
DataFrame df_sample is returned with the randomly selected rows.

The .sample() method can be useful when you want to randomly select a subset of the
data for testing or validation, or when you want to randomly select a sample of rows for
further analysis. The random_state parameter is useful for reproducibility and the
axis=1 parameter allows you to select columns.

11. df.isnull()
The isnull() method in Pandas returns a DataFrame of the same shape as the original
DataFrame, but with True or False values indicating whether each value in the
original DataFrame is missing or not. Missing values, such as NaN or None , will be
True in the resulting DataFrame, while non-missing values will be False .

df.isnull()

13/22
12. df.fillna()
The fillna() method in Pandas is used to fill in missing values in a DataFrame with a
specified value or method. By default, it replaces missing values with NaN , but you can
specify a different value to use instead as shown below:

value : Specifies the value to use to fill in the missing values. Can be a scalar value
or a dict of values for different columns.
method : Specifies the method to use for filling in missing values. Can be 'ffill'
(forward-fill) or 'bfill' (backward-fill) or 'interpolate'(interpolate values) or 'pad' or
'backfill'
axis : Specifies the axis along which to fill in missing values. It can be 0 (rows) or 1
(columns).
inplace : Whether to fill in the missing values in place (modifying the original
DataFrame) or to return a new DataFrame with the missing values filled in.
limit : Specifies the maximum number of consecutive missing values to fill.
downcast : Specifies a dictionary of values to use to downcast the data types of
columns.

14/22
# fill missing values with 0
df.fillna(0)

# forward-fill missing values (propagates last valid observation forward to next)


df.fillna(method='ffill')

# backward-fill missing values (propagates next valid observation backward to


last)

df.fillna(method='bfill')

df.interpolate()

It is important to note that the fillna() method returns a new DataFrame with the
missing values filled in and does not modify the original DataFrame in place. If you want
to modify the original DataFrame, you can use the inplace parameter and set it to
True .

df.fillna(, inplace=)

13. df.dropna()
df.dropna() is a method used in the Pandas library to remove missing or null values
from a DataFrame. It removes rows or columns from the DataFrame where at least one
element is missing.

You can remove all rows containing at least one missing value by calling df.dropna().

df = df.dropna()

If you want to remove only the columns that contain at least one missing value you can
use df.dropna(axis=1)

df = df.dropna(axis=)

You can also set thresh parameter to keep only the rows/columns that have at least
thresh non-NA/null values.

df = df.dropna(thresh=)

14. df.drop()
df.drop() is a method used in the Pandas library to remove rows or columns from a
DataFrame by specifying the corresponding labels. It can be used to drop one or multiple
rows or columns based on their labels.

You can remove a specific row by calling df.drop() and passing the index label of the
row you want to remove, and the axis parameter set to 0 (default is 0)

df_drop = df.drop()

This would remove the first row of the DataFrame.

15/22
You can also drop multiple rows by passing a list of index labels:

df_drop = df.drop([,])

This would remove the first and second rows of the DataFrame.

Similarly, you can drop columns by passing the labels of the columns you want to remove
and setting the axis parameter to 1:

df_drop = df.drop([, ], axis=)

15. pd.pivot_table()
pd.pivot_table() is a method in the Pandas library that is used to create a pivot table
from a DataFrame. A pivot table is a table that summarizes and aggregates data in a more
meaningful and organized way, by creating a new table with one or more columns as the
index, one or more columns as values, and one or more columns as attributes.

In the example below we will create a pivot table with Ethnicity as the index and
aggregate the sum of the count. This is used to know the count of each Ethnicity in the
dataset.

pivot_table = pd.pivot_table(df, =, =, aggfunc=)pivot_table.head()

16/22
You can also include more columns in the pivot table by specifying multiple index and
values parameters and also include multiple aggfunc functions.

pivot_table = pd.pivot_table(df, index=[,], values= , aggfunc=


[,])pivot_table.head()

16. df.groupby()
df.groupby() is a method in the Pandas library that is used to group rows of a
DataFrame based on one or multiple columns. This allows you to perform aggregate
operations on the groups, such as calculating the mean, sum, or count of the values in
each group.

df.groupby() returns a GroupBy object, which you can then use to perform various
operations on the groups, such as calculating the sum, mean, or count of the values in
each group.

17/22
Let's see an example using the birth names dataset:

grouped = df.groupby()(grouped.mean())

The output of the above code would be:

grouped = df.groupby(['Gender', 'Ethnicity'])

(grouped.())

The output of the above code would be:

17. df.transpose()
df.transpose() is a method in the Pandas library used to transpose the rows and
columns of a DataFrame? This means that the rows become columns and the columns
become rows.

# Transpose the DataFrame


df_transposed = df.transpose()

df_transposed.head()

18/22
It can also be done using T attributes on the dataframe. df.T will do the same as
df.transpose() .

df_transposed = df.Tdf_transposed.head()

18. df.merge()
df.merge() is a pandas function that allows you to combine two DataFrames based on
one or more common columns. It is similar to SQL JOINs. The function returns a new
DataFrame that contains only the rows where the values in the specified columns match
between the two DataFrames.

Here is an example of how to use df.merge() to combine two DataFrames based on a


common column:

# Create the first DataFrame


df1 = pd.DataFrame({'key': ['A', 'B', 'C', 'D'],


'value': [1, 2, 3, 4]})

# Create the second DataFrame


df2 = pd.DataFrame({'key': ['B', 'D', 'E', 'F'],


'value': [5, 6, 7, 8]})

# Merge the two DataFrames on the 'key' column


merged_df = df1.merge(df2, on='key')

(merged_df)

19/22
As you can see the two dataframe are merged on
the key column and the column name is
appended with _x and _y for the left and right
dataframe respectively.

You can also use left, right and outer join by passing how = ‘left’, how = ‘right’, or how =
‘outer’ respectively.

You can also merge multiple columns by passing a list of columns to the on parameter.

merged_df = df1.merge(df2, on=[,])

You can also specify a different column name to merge by using left_on and right_on
parameters.

merged_df = df1.merge(df2, left_on=, right_on=)

It’s worth noting that the merge() function has many options and parameters that allow
you to control the behavior of the merge, such as how to handle missing values, whether
to keep all rows or only those that match and what columns to merge on.

19. df.rename()
df.rename() is a pandas function that allows you to change the name of one or more
columns or rows in a DataFrame. You can use the columns parameter to change the
column names, and the index parameter to change the row names.

df_rename = df.rename(columns={: })df_rename.head()

You can also use a dictionary to rename multiple columns at once:

df_rename = df.rename(columns={: , :})df_rename.head()

20/22
You can also rename the index similarly:

df_rename = df.rename(index={:,:,:})df_rename.head()

20. df.to_csv()
df.to_csv() is a method used in the Pandas library to export a DataFrame to a CSV
file. CSV stands for "Comma Separated Values" and it is a popular file format for storing
data in a tabular form.

For example, let’s say we want to save df that you want to export to a CSV file. You can
export the DataFrame to a CSV file by calling df.to_csv() and passing the file name as
a string:

df.to_csv()

This will save the DataFrame to a file named data.csv in the current working directory.
You can also specify the path of the file by passing it to the method:

df.to_csv()

You can also specify the separator used in the CSV file by passing the sep parameter. By
default, it’s set to “,”.

df.to_csv(, sep=)

It is also possible to only save specific columns of the DataFrame by passing the list of
column names to the columns parameter, and also to save only specific rows by passing a
boolean mask to the index parameter.

df.to_csv(, columns=[,])

21/22
You can also use the index parameter to specify whether to include or exclude the index
of the dataframe in the exported CSV file.

df.to_csv(, index=)

This will exclude the index of the dataframe in the exported CSV file.

You can also use na_rep parameter to replace missing values in the exported CSV file
with a specific value.

df.to_csv(, na_rep=)

In summary here are the 20 pandas functions that can be used to accomplish most of your
data tasks:

1. pd.read_csv()
2. df.describe()
3. df.info()
4. df.plot()
5. df.iloc()
6. df.loc()
7. df.assign()
8. df.query()
9. df.sort_values()
10. df.sample()
11. df.isnull()
12. df.fillna()
13. df.dropna()
14. df.drop()
15. pd.pivot_table()
16. df.groupby()
17. df.transpose()
18. df.merge()
19. df.rename()
20. df.to_csv()

Read every story from Youssef Hosni (and thousands of other writers on
Medium). Your membership fee directly supports…

22/22

You might also like