0% found this document useful (0 votes)
20 views35 pages

Https - Regenerativetoday - Com - 30 Very Useful Pandas Functions For Everyday Data Analysis Tasks

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
20 views35 pages

Https - Regenerativetoday - Com - 30 Very Useful Pandas Functions For Everyday Data Analysis Tasks

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 35

regenerative

(https://regenerativetoday.com/)
 (https://twitter.com/rashida048)  (https://www.instagram.com/rashi048/)
 (https://www.linkedin.com/in/rashida-sucky-5b897b43/)
 (https://github.com/rashida048?tab=repositories)
 (https://www.youtube.com/channel/UCzJgOvsJJPCXWytXWuVSeXw)

Data Science  (https://regenerativetoday.com/category/data-science/)

Programming (https://regenerativetoday.com/category/programming/)

Natural Language Processing (https://regenerativetoday.com/category/natural-

language-processing/)

Machine Learning (https://regenerativetoday.com/category/machine-learning/)

Deep Learning (https://regenerativetoday.com/category/deep-learning/)

Statistics (https://regenerativetoday.com/category/statistics/)
RECENT POSTS

Complete Implementation of
a Mini VGG Network for
Image Recognition
(https://regenerativetoday.c
om/complete-
implementation-of-a-mini-
vgg-network-for-image-
recognition/)

Easy Method of Edge


Detection in OpenCV Python
(https://regenerativetoday.c
om/easy-method-of-edge-
Pandas functions
detection-in-opencv-python/)

30 Very Useful Pandas How to Perform Image


Segmentation with

Functions for Everyday Data Thresholding Using OpenCV


(https://regenerativetoday.c

Analysis Tasks om/how-to-perform-image-


segmentation-with-
thresholding-using-opencv/)

 rashida048 (https://regenerativetoday.com/author/rashida048/) -
Morphological Operations
 January 26, 2022 -
 Data Science (https://regenerativetoday.com/category/data-science/) - for Image Preprocessing in
 0 Comments (https://regenerativetoday.com/30-very-useful-pandas-functions-for-
OpenCV, in Detail
everyday-data-analysis-tasks/#respond)
(https://regenerativetoday.c
om/morphological-

Python’s Pandas library is the most widely used library in operations-for-image-


preprocessing-in-opencv-in-
Python. Because this is the data manipulation library
detail/)
that is necessary for every aspect of data analysis or
Some Basic Image
machine learning. Even if you are working on data Preprocessing Operations

visualization or machine learning, some data for Beginners in Python


(https://regenerativetoday.c
manipulation will be there anyway. In this article, I will list
om/some-basic-image-
the Pandas functions that are necessary for everyday preprocessing-operations-
for-beginners-in-python/)
use and arguably will be enough to perform the regular
data manipulation tasks.
SUBSCRIBE FOR
For this article, I will use a public dataset from Kaggle NEWSLETTERS

called the FIFA dataset. Please subscribe here for


the latest posts and news
The user license is mentioned here
Name
(https://www.kaggle.com/stefanoleone992/fifa-21-
complete-player-dataset/metadata).
Email*

Please feel free to download the dataset from here.


(https://github.com/rashida048/Datasets/blob/master/fif
S U BMIT
a.csv)

Here’s the fun begins!

I am importing the necessary packages and the dataset:

import numpy as np
import pandas as pd
pd.set_option('display.max_columns', 100)

Let’s start talking about the functions:

1. pd.read_csv, pd.read_excel

The first function to mention is read_csv or read_excel.


Till now I used at least one of these functions in every
project. The functions are self-explanatory already. They
are used to read a CSV or an excel file to a pandas
DataFrame format. Here I am using the read_csv
function to read the FIFA dataset:

df = pd.read_csv("fifa.csv")
df.head(7)

2. df.columns

When you have a big dataset like that it can be hard to


see all the columns. using .columns function, you can
print out all the columns of the dataset:

df.columns

Output:

Index(['Unnamed: 0', 'sofifa_id', 'player_url


'mentality_composure', 'defending_marking', '

3. df.drop()

You can drop some unnecessary columns using


df.drop(). In this dataset we have so many columns we
are not going to use all of them for this tutorial. So, we
can easily drop some:

df = df.drop(columns=['Unnamed: 0', 'weak_foo

I just dropped these three columns: ‘Unnamed: 0’,


‘weak_foot’, ‘real_face’.

4. .len()

Provides with the length of the DataFrame. Let’s see an


example:
len(df)

Output:

16155

This DataFrame has 16155 rows of data.

5. df.query()
(https://pandas.pydata.org/docs/reference/api/pa
ndas.DataFrame.query.html)

You can filter or query using a boolean expression. I will


use ‘shooting’ and ‘passing’ columns for this example.
Here I am checking for which rows ‘shooting’ is bigger
than ‘passing’.

df.query("shooting > passing")

This will return the rows only where the shooting is


bigger than passing.

6. df.iloc()

This function takes as a parameter the rows and column


indices and gives you the subset of the DataFrame
accordingly. Here I am taking the first 10 rows of data
and index 5th to index 10th columns:

df.iloc[:10, 5:10]
7. df.loc()

This function does almost the similar operation as .iloc()


function. But here we can specify exactly which row
index we want and also the name of the columns we
want in our subset. Here is an example:

df.loc[[3, 10, 14, 23], ['nationality', 'weig

Look at the row indices. We only have the 3rd, 10th, 14th,
and 23rd rows. On the other hand, for columns, we only
have the specified columns.

8. df[‘’].dtypes

Another very basic and widely used functions. Because it


is necessary to know the data types of the variables
before we dive into the analysis, visualization, or
predictive modeling. I am getting the data type of the
‘height_cm’ column using .dtypes function here:

df.height_cm.dtypes

Output:

dtype('int64')

You have the option to get the data type of each and
every column as well using this syntax:

df.dtypes

Output:
height_cm int64
weight_kg int64
nationality object
random_col int32
club_name object
league_name object
league_rank float64
overall int64
potential int64
value_eur int64
wage_eur int64
player_positions object
preferred_foot object
international_reputation int64
skill_moves int64
work_rate object
body_type object
team_position object
team_jersey_number float64
nation_position object
nation_jersey_number float64
pace float64
shooting float64
passing float64
dribbling float64
defending float64
physic float64
cumsum_2 int64
rank_calc float64
dtype: object
9. df.select_dtypes()

You can select the variables or columns of a certain data


type using this function. For example, I want to select the
columns with data types ‘int64’ only. Here is how to do
that:

df.select_dtypes(include='int64')

We got all the columns that have the data type ‘int64’. If
we use ‘exclude’ instead of ‘include’ in the ‘select_dtypes’
function, we will get the columns that do not have the
data type ‘int64’:

df.select_dtypes(exclude='int64')
Here is part of the output. Look, the variables are not
integers. You may think that the ‘random_col’ column is
integers. But if you check its data type, you will see that
it looks integers but its data type is different. Please feel
free to check.

10. df.insert()
(https://pandas.pydata.org/docs/reference/api/pa
ndas.DataFrame.insert.html)

As the name of the function suggests, it inserts a


column in the specified position. To demonstrate that I
will first create an array of random numbers that have
the length of our DataFrame:

random_col = np.random.randint(100, size=len(

I will insert this array as a column in the DataFrame df at


column 3 position. Remember, the column index starts
from zero.

df.insert(3, 'random_col', random_col)


Here is the part of the DataFrame again:

df.head()

Script for Ads:


Look, the column ‘random_col’ is inserted at position
three.

11. df[‘’].cumsum()
(https://pandas.pydata.org/pandas-
docs/stable/reference/api/pandas.DataFrame.cu
msum.html)

It provides you with the cumulative sum. Let me explain


with an example. I am going to use the ‘value_eur’ and
‘wage_eur’columns for this example. Here is the code:

df[['value_eur', 'wage_eur']].cumsum()

Output:
As you can see in every row it provides you with the
cumulative sum of all the values of the previous rows.

12. df.sample()

When the size of the dataset is too big, you can take a
representative sample from it to perform the analysis
and predictive modeling. That may save you some time.
Also, too much data may ruin the visualization
sometimes. we can use this function to get a certain
number of data points or a certain fraction or data point.
Here I am taking a sample of 200 data points from the
FIFA dataset. It takes a random sample.

df.sample(n = 200)

I am taking 25% of the FIFA dataset here:

df.sample(frac = 0.25)
13. df[‘’].where()

This function helps you query a dataset based on a


boolean condition. For an example, the random_col we
made before has the values ranging from 0 to 100. Here
is how we make a series to see which of them are bigger
than 50.

df['random_col'].where(df['random_col'] > 50)

Output:

0 NaN
1 NaN
2 56.0
3 NaN
4 NaN
...
16150 65.0
16151 NaN
16152 NaN
16153 57.0
16154 NaN
Name: random_col, Length: 16155, dtype: float

Look, where the values do not meet the condition that


means the value is not greater than 50, returns NaN. We
can replace NaN with 0 or any other value using this
syntax:

df['random_col'].where(df['random_col'] > 50,


Output:

0 0
1 0
2 56
3 0
4 0
..
16150 65
16151 0
16152 0
16153 57
16154 0
Name: random_col, Length: 16155, dtype: int32

14. df[‘’].unique()

This is very useful where we have categorical variables.


It is used to find out the unique values of a categorical
column. Let’s see what are the unique values of the
‘skill_moves’ column in our FIFA dataset:

df.skill_moves.unique()

Output:

array([4, 5, 1, 3, 2], dtype=int64)

So, we have five unique values in the skill_moves


columns. If we print out the head of the dataset to check
out the values of the columns you may not see all the
unique values in it. So, to know all the unique values
.unique() function comes out really handy.

15. df[‘’].nunique()

Another popular function. This function lets you know


how many unique values do you have in a column. As an
example, if you want to see how many different
nationalities are there in this dataset, you can use this
simple line of code

df.nationality.nunique()

Output:

149

The great thing is, this function can be used on the total
dataset as well to know the number of unique values in
each column:

df.nunique()

Output:
height_cm 48
weight_kg 54
nationality 149
random_col 100
club_name 577
league_name 37
league_rank 4
overall 53
potential 49
value_eur 161
wage_eur 41
player_positions 907
preferred_foot 2
international_reputation 5
skill_moves 5
work_rate 9
body_type 3
team_position 29
team_jersey_number 99
nation_position 28
nation_jersey_number 26
pace 74
shooting 70
passing 67
dribbling 67
defending 69
physic 63
cumsum_2 14859
rank_calc 161
dtype: int64

Here we have the number of unique values in each


column.
16. df[‘’].rank()

This function provides you with the rank based on a


certain column. In the FIFA dataset, if we want to rank
the players based on the ‘value_eur’ column, here is the
syntax for that:

df['rank_calc'] = df["value_eur"].rank()

Using the line of code above, I created a new column


named ‘rank_calc’. This new column will give you the
ranks of each player based on the ‘value_eur’. The
column will be added at the end by default. Please run
the line of code by yourself to check.

Script for Ads:


17. .isin()

I am going to make a subset of the dataset that will


contain only a few nationalities of players using .isin()
function.

nationality = ["Argentina", "Portugal", "Swed


df[df.nationality.isin(nationality)]

If you run this code you will see we have the resulting
dataset containing only those few countries mentioned
in the list above. You can see the part of the dataset
here:
18. df.replace()

It does exactly what it sounds like. It replaces the values


of a column. When we need to replace only one unique
value of a column we simply need to pass the old value
and the new value. Imagine, we just found out that the
‘league_rank’ 1.0 needs to be replaced by 1.1 now. Here
is how to do that:

df.replace(1.0, 1.1)

Look at the league_rank column in the dataset now, 1.0


is replaced by 1.1. If we need to change more than one
value, we can pass a dictionary to the replace function
where the key should be the original value and the value
should be the replacement.

df.replace({1.0: 1.1, 4.0: 4.1, 3.0: 3.1})

19. df.rename()

It is used to rename the column/s. Here I am changing


the ‘weight_kg’ and ‘height_cm’ columns to “Weight (kg)”
and “Height (cm)”:

df.rename(columns = {"weight_kg": "Weight (kg

Very simple and useful!


20. .fillna()

Whenever you will receive a big dataset in real life, there


will be some null values in most cases. It is really hard to
get a perfect dataset. So, filling up the null values is part
of your daily task if you are a data analyst or a data
scientist. This function .fillna() replaces the null values
with some other value of your choice. Here are some of
the columns towards the end of the FIFA dataset:

Look, there are some null values in shooting, passing,


defending, and some other columns. We really need to
replace those null values with some values of
compatible data types before we start doing any
predictive modeling and also some other data science
tasks. Otherwise, we may get errors. For example in the
‘pace’ column, the values should be numeric but here
and there you will see NaN values. The most generic but
not so efficient way is to replace those NaN values with
zeros. Here is the way to change the all the NaN values
of the ‘pace’ column with zeros:
df['pace'].fillna(0, inplace=True)

If you notice, the NaN in the pace column is zero now. In


the total pace column, if there are more NaN values they
should also be replaced by zeros.

As I mentioned before replacing by zero may not be the


most efficient way. You can replace it with some other
value of your choice. It is also common to replace values
with the mean or median. If we wanted to replace the
NaN values of the pace column with the mean of space
column we would have used this line of code instead:

df['pace'].fillna(df['pace'].mean(), inplace

21. df.groupby()

This is the most popular function for data summarizing.


You can group the data as per a certain variable and find
out useful information about those groups. For example,
here I am grouping the data by nationality and
calculating the total ‘value_eur’ for each nationality:
df.groupby("nationality")['value_eur'].sum()

Output:

nationality
Albania 25860000
Algeria 70560000
Angola 6070000
Antigua & Barbuda 1450000
Argentina 1281372000
...
Uzbekistan 7495000
Venezuela 41495000
Wales 113340000
Zambia 4375000
Zimbabwe 6000000
Name: value_eur, Length: 149, dtype: int64

The sum of ‘value_eur’ for all the players of Albania is


25860000.

It is also possible to group by several variables and use


several aggregate functions. We will see for each
nationality and each league rank’s mean value_eur,
median value_eur, mean wage_eur, and median
wage_eur.

df.groupby(['nationality', 'league_rank'])['v

Output:
22. .pct_change()

You can get the percent change from the previous value
of a variable. For this demonstration, I will use the
value_eur column and get the percent change from the
previous for each row of data. The first row will be NaN
because there is no value to compare before.

df.value_eur.pct_change()

Output
0 NaN
1 -0.213930
2 -0.310127
3 -0.036697
4 0.209524
...
16150 0.000000
16151 0.500000
16152 -0.500000
16153 0.000000
16154 -1.000000
Name: value_eur, Length: 16155, dtype: float6

You may not feel this as important in this dataset.

But think of some financial data. Specially


when you have stock market value of
everyday. How nice it would be to see the
percent change in every day value.

Script for Ads:


23. df.count()

It provides you the number of data in the DataFrame in


the specified direction. When the direction is 0, it
provides the number of data in the columns:

df.count(0)

Output:
Unnamed: 0 16155
sofifa_id 16155
player_url 16155
short_name 16155
long_name 16155
...
goalkeeping_diving 16155
goalkeeping_handling 16155
goalkeeping_kicking 16155
goalkeeping_positioning 16155
goalkeeping_reflexes 16155
Length: 81, dtype: int64

You can see the number of data in each column.

When the direction is 1, it provides the number of data in


the rows:

df.count(1)

Output:
0 72
1 72
2 72
3 72
4 71
..
16150 68
16151 68
16152 68
16153 68
16154 69
Length: 16155, dtype: int64

As you can see, each row does not have the same
number of data. If you observe the dataset carefully, you
will see that it has a lot of null values in several columns.

24. df[‘’].value_counts()

We can get the value counts of each category using this


function. Here I am getting how many values are there in
each league_rank.

df['league_rank'].value_counts()

Output:

1.0 11738
2.0 2936
3.0 639
4.0 603
Name: league_rank, dtype: int64
It returns the result sorted by default. If you want the
result in ascending order, simply set ascending=True:

df['league_rank'].value_counts(ascending=True

Output:

4.0 603
3.0 639
2.0 2936
1.0 11738
Name: league_rank, dtype: int64

25. pd.crosstab()

It gives you a frequency table that is a cross-tabulation


of two variables. I am making a cross-tabulation of
league_rank and international_reputation here:

pd.crosstab(df['league_rank'], df['internatio

So, we got the number count of all the combinations of


league_rank and international_reputation. We can see
that the majority of players have international_reputation
and league_rank both 1.
It can be improved further. We can add margins in both
directions that will be the total and also we can get the
normalized values if necessary:

pd.crosstab(df['league_rank'], df['internatio
margins = True,
margins_name="Total",
normalize = True)

26. pd.qcut()

This function bins the data or segments the data based


on the distribution of the data. So, we get the range for
each player. Here I am going to segment the value_eur in
5 portions and get which player falls in which portion:

pd.qcut(df['value_eur'], q = 5)

Output:
0 (1100000.0, 100500000.0]
1 (1100000.0, 100500000.0]
2 (1100000.0, 100500000.0]
3 (1100000.0, 100500000.0]
4 (1100000.0, 100500000.0]
...
16150 (-0.001, 100000.0]
16151 (-0.001, 100000.0]
16152 (-0.001, 100000.0]
16153 (-0.001, 100000.0]
16154 (-0.001, 100000.0]
Name: value_eur, Length: 16155, dtype: catego
Categories (5, interval[float64]): [(-0.001,

You can use the value_counts on the above line of code


to see how players fall in which range:

pd.qcut(df['value_eur'], q = 5).value_counts(

Output:

(-0.001, 100000.0] 3462


(230000.0, 500000.0] 3305
(100000.0, 230000.0] 3184
(500000.0, 1100000.0] 3154
(1100000.0, 100500000.0] 3050
Name: value_eur, dtype: int64

As you can see the numbers are pretty close. By default,


qcut tries to divide them equally. But in real life, it doesn’t
want to be equal always. Because the distribution is not
uniform most of the time.

27. pd.cut()

Another method for binning. If we want to make 5 bins


using cut, it will divide the entire value_eur range into
equal five portions and the population in each bin will
follow accordingly.

pd.cut(df['value_eur'], bins = 5).value_count

Output:

(-100500.0, 20100000.0] 16102


(20100000.0, 40200000.0] 40
(40200000.0, 60300000.0] 10
(60300000.0, 80400000.0] 2
(80400000.0, 100500000.0] 1
Name: value_eur, dtype: int64

The interval in each range is equal. But the population in


each group is very different.

28. df[‘’].describe()

This is a great function that provides some basic


statistical measures. Here I am using the describe
function on the wage_eur column:

df['wage_eur'].describe()

Output:
count 16155.000000
mean 13056.453110
std 23488.182571
min 0.000000
25% 2000.000000
50% 5000.000000
75% 10000.000000
max 550000.000000
Name: wage_eur, dtype: float64

As the output shows, we have eight different measures.


Each of them is very significant.

29. nlargest and nsmallest

This gives you the dataset with n number of largest


values or smallest values of a specified variable. As an
example, I wanted to get the rows with the top 5
wage_eur:

df.nlargest(5, "wage_eur")

In the same way, I can make a subset of the dataset with


the 5 smallest wage_eur data:
df.nsmallest(5, "wage_eur")

30. df.explode()

Explode can be useful when you have a list of data in


some rows. It is hard to analyze, visualize or perform
some predictive modeling when you have integers in
some columns and lists in some columns. Explode helps
to break down those lists. For example, look at this
DataFrame:

df1 = pd.DataFrame({"city": ['A', 'B', 'C'],


"day1": [22, 25, 21],
'day2':[31, 12, 67],
'day3': [27, 20, 15],
'day4': [34, 37, [41, 45,
'day5': [23, 54, 36]})
df1

Let’s explode column d4:


df1.explode(jupyter notebook
'day4').reset_index(drop=True)

Conclusion

Python’s panda’s library is so big. There are so many


functions. I choose some important functions in
everyday life. If you know these ones very well you will be
able to perform most analysis tasks successfully.
Panda’s has one more very useful function that I didn’t
mention here that is .plot() function. You can plot using
pandas only. Pandas use Matplotlib in the backend and
return the plot for you. I have a detailed tutorial on that
here (https://regenerativetoday.com/a-complete-cheat-
sheet-for-data-visualization-in-pandas/).

Hopefully, this article was helpful.

Please feel free to follow me on Twitter


(https://twitter.com/rashida048), the Facebook page
(https://www.facebook.com/Regenerative-
149425692134498), and check out my new YouTube
channel
(https://www.youtube.com/channel/UCzJgOvsJJPCXWy
tXWuVSeXw)
Script for Ads:

#DataScience #DataAnalytics #Pandas #Python #DataAnalysis

 YOU MIGHT ALSO LIKE

Custom Holiday
Calendar in
Python
(https://regenerativeto (https://regener
(https://regenerativeto ativetoday.com/
day.com/a-complete-
day.com/all-the- custom-holiday-
guide-to-time-series-
datasets-you-need-to- calendar-in-
analysis-in-pandas/)
practice-data-science- python/)
skills-and-make-a- A Complete  July 27, 2020

great-portfolio/) Guide to Time


Series Analysis
All the Datasets in Pandas
You Need to (https://regener
Practice Data ativetoday.com/
Science Skills a-complete-
and Make a guide-to-time-
Great Portfolio series-analysis-
(https://regener in-pandas/)
ativetoday.com/  November 14, 2020
all-the-datasets-
you-need-to-
practice-data-
science-skills-
and-make-a-
great-portfolio/)
 August 25, 2020
Leave a Reply

Your Comment Here...

Name (required) Email (required) Website

POS T C OMMENT

Copyright - OceanWP Theme by Nick

You might also like