Https - Regenerativetoday - Com - 30 Very Useful Pandas Functions For Everyday Data Analysis Tasks
Https - Regenerativetoday - Com - 30 Very Useful Pandas Functions For Everyday Data Analysis Tasks
(https://regenerativetoday.com/)
(https://twitter.com/rashida048) (https://www.instagram.com/rashi048/)
(https://www.linkedin.com/in/rashida-sucky-5b897b43/)
(https://github.com/rashida048?tab=repositories)
(https://www.youtube.com/channel/UCzJgOvsJJPCXWytXWuVSeXw)
Programming (https://regenerativetoday.com/category/programming/)
language-processing/)
Statistics (https://regenerativetoday.com/category/statistics/)
RECENT POSTS
Complete Implementation of
a Mini VGG Network for
Image Recognition
(https://regenerativetoday.c
om/complete-
implementation-of-a-mini-
vgg-network-for-image-
recognition/)
rashida048 (https://regenerativetoday.com/author/rashida048/) -
Morphological Operations
January 26, 2022 -
Data Science (https://regenerativetoday.com/category/data-science/) - for Image Preprocessing in
0 Comments (https://regenerativetoday.com/30-very-useful-pandas-functions-for-
OpenCV, in Detail
everyday-data-analysis-tasks/#respond)
(https://regenerativetoday.c
om/morphological-
import numpy as np
import pandas as pd
pd.set_option('display.max_columns', 100)
1. pd.read_csv, pd.read_excel
df = pd.read_csv("fifa.csv")
df.head(7)
2. df.columns
df.columns
Output:
3. df.drop()
4. .len()
Output:
16155
5. df.query()
(https://pandas.pydata.org/docs/reference/api/pa
ndas.DataFrame.query.html)
6. df.iloc()
df.iloc[:10, 5:10]
7. df.loc()
Look at the row indices. We only have the 3rd, 10th, 14th,
and 23rd rows. On the other hand, for columns, we only
have the specified columns.
8. df[‘’].dtypes
df.height_cm.dtypes
Output:
dtype('int64')
You have the option to get the data type of each and
every column as well using this syntax:
df.dtypes
Output:
height_cm int64
weight_kg int64
nationality object
random_col int32
club_name object
league_name object
league_rank float64
overall int64
potential int64
value_eur int64
wage_eur int64
player_positions object
preferred_foot object
international_reputation int64
skill_moves int64
work_rate object
body_type object
team_position object
team_jersey_number float64
nation_position object
nation_jersey_number float64
pace float64
shooting float64
passing float64
dribbling float64
defending float64
physic float64
cumsum_2 int64
rank_calc float64
dtype: object
9. df.select_dtypes()
df.select_dtypes(include='int64')
We got all the columns that have the data type ‘int64’. If
we use ‘exclude’ instead of ‘include’ in the ‘select_dtypes’
function, we will get the columns that do not have the
data type ‘int64’:
df.select_dtypes(exclude='int64')
Here is part of the output. Look, the variables are not
integers. You may think that the ‘random_col’ column is
integers. But if you check its data type, you will see that
it looks integers but its data type is different. Please feel
free to check.
10. df.insert()
(https://pandas.pydata.org/docs/reference/api/pa
ndas.DataFrame.insert.html)
df.head()
11. df[‘’].cumsum()
(https://pandas.pydata.org/pandas-
docs/stable/reference/api/pandas.DataFrame.cu
msum.html)
df[['value_eur', 'wage_eur']].cumsum()
Output:
As you can see in every row it provides you with the
cumulative sum of all the values of the previous rows.
12. df.sample()
When the size of the dataset is too big, you can take a
representative sample from it to perform the analysis
and predictive modeling. That may save you some time.
Also, too much data may ruin the visualization
sometimes. we can use this function to get a certain
number of data points or a certain fraction or data point.
Here I am taking a sample of 200 data points from the
FIFA dataset. It takes a random sample.
df.sample(n = 200)
df.sample(frac = 0.25)
13. df[‘’].where()
Output:
0 NaN
1 NaN
2 56.0
3 NaN
4 NaN
...
16150 65.0
16151 NaN
16152 NaN
16153 57.0
16154 NaN
Name: random_col, Length: 16155, dtype: float
0 0
1 0
2 56
3 0
4 0
..
16150 65
16151 0
16152 0
16153 57
16154 0
Name: random_col, Length: 16155, dtype: int32
14. df[‘’].unique()
df.skill_moves.unique()
Output:
15. df[‘’].nunique()
df.nationality.nunique()
Output:
149
The great thing is, this function can be used on the total
dataset as well to know the number of unique values in
each column:
df.nunique()
Output:
height_cm 48
weight_kg 54
nationality 149
random_col 100
club_name 577
league_name 37
league_rank 4
overall 53
potential 49
value_eur 161
wage_eur 41
player_positions 907
preferred_foot 2
international_reputation 5
skill_moves 5
work_rate 9
body_type 3
team_position 29
team_jersey_number 99
nation_position 28
nation_jersey_number 26
pace 74
shooting 70
passing 67
dribbling 67
defending 69
physic 63
cumsum_2 14859
rank_calc 161
dtype: int64
df['rank_calc'] = df["value_eur"].rank()
If you run this code you will see we have the resulting
dataset containing only those few countries mentioned
in the list above. You can see the part of the dataset
here:
18. df.replace()
df.replace(1.0, 1.1)
19. df.rename()
df['pace'].fillna(df['pace'].mean(), inplace
21. df.groupby()
Output:
nationality
Albania 25860000
Algeria 70560000
Angola 6070000
Antigua & Barbuda 1450000
Argentina 1281372000
...
Uzbekistan 7495000
Venezuela 41495000
Wales 113340000
Zambia 4375000
Zimbabwe 6000000
Name: value_eur, Length: 149, dtype: int64
df.groupby(['nationality', 'league_rank'])['v
Output:
22. .pct_change()
You can get the percent change from the previous value
of a variable. For this demonstration, I will use the
value_eur column and get the percent change from the
previous for each row of data. The first row will be NaN
because there is no value to compare before.
df.value_eur.pct_change()
Output
0 NaN
1 -0.213930
2 -0.310127
3 -0.036697
4 0.209524
...
16150 0.000000
16151 0.500000
16152 -0.500000
16153 0.000000
16154 -1.000000
Name: value_eur, Length: 16155, dtype: float6
df.count(0)
Output:
Unnamed: 0 16155
sofifa_id 16155
player_url 16155
short_name 16155
long_name 16155
...
goalkeeping_diving 16155
goalkeeping_handling 16155
goalkeeping_kicking 16155
goalkeeping_positioning 16155
goalkeeping_reflexes 16155
Length: 81, dtype: int64
df.count(1)
Output:
0 72
1 72
2 72
3 72
4 71
..
16150 68
16151 68
16152 68
16153 68
16154 69
Length: 16155, dtype: int64
As you can see, each row does not have the same
number of data. If you observe the dataset carefully, you
will see that it has a lot of null values in several columns.
24. df[‘’].value_counts()
df['league_rank'].value_counts()
Output:
1.0 11738
2.0 2936
3.0 639
4.0 603
Name: league_rank, dtype: int64
It returns the result sorted by default. If you want the
result in ascending order, simply set ascending=True:
df['league_rank'].value_counts(ascending=True
Output:
4.0 603
3.0 639
2.0 2936
1.0 11738
Name: league_rank, dtype: int64
25. pd.crosstab()
pd.crosstab(df['league_rank'], df['internatio
pd.crosstab(df['league_rank'], df['internatio
margins = True,
margins_name="Total",
normalize = True)
26. pd.qcut()
pd.qcut(df['value_eur'], q = 5)
Output:
0 (1100000.0, 100500000.0]
1 (1100000.0, 100500000.0]
2 (1100000.0, 100500000.0]
3 (1100000.0, 100500000.0]
4 (1100000.0, 100500000.0]
...
16150 (-0.001, 100000.0]
16151 (-0.001, 100000.0]
16152 (-0.001, 100000.0]
16153 (-0.001, 100000.0]
16154 (-0.001, 100000.0]
Name: value_eur, Length: 16155, dtype: catego
Categories (5, interval[float64]): [(-0.001,
pd.qcut(df['value_eur'], q = 5).value_counts(
Output:
27. pd.cut()
Output:
28. df[‘’].describe()
df['wage_eur'].describe()
Output:
count 16155.000000
mean 13056.453110
std 23488.182571
min 0.000000
25% 2000.000000
50% 5000.000000
75% 10000.000000
max 550000.000000
Name: wage_eur, dtype: float64
df.nlargest(5, "wage_eur")
30. df.explode()
Conclusion
Custom Holiday
Calendar in
Python
(https://regenerativeto (https://regener
(https://regenerativeto ativetoday.com/
day.com/a-complete-
day.com/all-the- custom-holiday-
guide-to-time-series-
datasets-you-need-to- calendar-in-
analysis-in-pandas/)
practice-data-science- python/)
skills-and-make-a- A Complete July 27, 2020
POS T C OMMENT