Pandas Cheat Sheet
Pandas Cheat Sheet
By Ammar gamal
---------------------------
Pandas Introduction
- What is Pandas?
Pandas is a Python library used for working with data sets.
The name "Pandas" has a reference to both "Panel Data", and "Python Data Analysis" and was
created by Wes McKinney in 2008.
Pandas can clean messy data sets, and make them readable and relevant.
Is there a correlation between two or more columns? What is average value? Max value? Min
value? Pandas are also able to delete rows that are not relevant, or contains wrong values, like
empty or NULL values. This is called cleaning the data.
https://github.com/pandas-dev/pandas
Pandas Getting Started
- Installation of Pandas
If you have Python and PIP already installed on a system, then installation of Pandas is very easy.
If this command fails, then use a python distribution that already has Pandas installed like,
Anaconda, Spyder etc.
Import Pandas as pd
Pandas is usually imported under the pd alias.
2.2.2
methods pandas
print(dir(pd))
Pandas Series
What is a Series?
A Pandas Series is like a column in a table.
0 399
1 380
2 389
3 345
dtype: int64
-Labels
If nothing else is specified, the values are labeled with their index number. First value has index
0, second value has index 1 etc.
# Example
# Return the first value of the Series:
print(x[0])
399
- Create Labels
With the index argument, you can name your own labels.
student = ["Mahmoud","mohamed","mohsen","hamda"]
x= pd.Series(data = std_mark, index= student)
x
Mahmoud 399
mohamed 380
mohsen 389
hamda 345
dtype: int64
# When you have created labels, you can access an item by referring to
the label.
print(x["hamda"])
345
myvar = pd.Series(calories)
print(myvar)
day1 420
day2 380
day3 390
dtype: int64
day1 420
day2 380
dtype: int64
Pandas DataFrames
What is a DataFrame?
A Pandas DataFrame is a 2 dimensional data structure, like a 2 dimensional array, or a table with
rows and columns.
print(pd.DataFrame())
Empty DataFrame
Columns: []
Index: []
Locate Row
As you can see from the result above, the DataFrame is like a table with rows and columns.
Pandas use the loc() attribute to return one or more specified row(s)
std_id 1204
std_name Ammar
std_marks 400
percentage 97.56
Name: 0, dtype: object
# Example
# Return row 0 and 1:
#use a list of indexes:
print(df.loc[[0, 1]])
# Note: When using [], the result is a Pandas DataFrame.
std_id std_name std_marks percentage
0 1204 Ammar 400 97.56
1 1205 Gamal 405 98.78
data = {
#column Row
"calories": [420, 380, 390],
"duration": [50, 40, 45]
}
df
calories duration
day1 420 50
day2 380 40
day3 390 45
calories 380
duration 40
Name: day2, dtype: int64
CSV files contains plain text and is a well know format that can be read by everyone including
Pandas.
df = pd.read_csv('D:/Ai Diploma/insurance.csv')
df.head()
# if the number of rows is not specified, the head() method will
return the top 5 rows.
df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1338 entries, 0 to 1337
Data columns (total 7 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 age 1337 non-null float64
1 sex 1337 non-null object
2 bmi 1337 non-null float64
3 children 1338 non-null int64
4 smoker 1337 non-null object
5 region 1338 non-null object
6 charges 1337 non-null float64
dtypes: float64(3), int64(1), object(3)
memory usage: 73.3+ KB
• Empty cells
• Data in wrong format
• Wrong data
• Duplicates In this tutorial you will learn how to deal with all of them.
age 1
sex 1
bmi 2
children 0
smoker 4
region 0
charges 2
dtype: int64
df = pd.read_csv('D:/Ai Diploma/insurance_with_null.csv')
df.head(11)
Remove Rows
One way to deal with empty cells is to remove rows that contain empty cells.
This is usually OK, since data sets can be very big, and removing a few rows will not have a big
impact on the result.
# ExampleGet
# Return a new Data Frame with no empty cells:
df = pd.read_csv('D:/Ai Diploma/insurance_with_null.csv')
new_df = df.dropna()
new_df.head(25)
# remove 1,5,9,10,11,12 ,23,31
df.dropna(inplace=True)
df.head(5)
This way you do not have to delete entire rows just because of some empty cells.
df = pd.read_csv('D:/Ai Diploma/insurance_with_null.csv')
df.head(15)
age sex bmi children smoker region charges
0 19.0 female 27.900 0 yes southwest 16884.92400
1 18.0 male 33.770 1 no southeast 130.00000
2 28.0 male 33.000 3 no southeast 4449.46200
3 33.0 male 22.705 0 no northwest 21984.47061
4 32.0 male 28.880 0 no northwest 3866.85520
5 31.0 female 130.000 0 no southeast 3756.62160
6 46.0 female 33.440 1 no southeast 8240.58960
7 37.0 female 27.740 3 no northwest 7281.50560
8 37.0 male 29.830 2 no northeast 6406.41070
9 60.0 130 25.840 0 no northwest 28923.13692
10 25.0 male 26.220 0 130 northeast 2721.32080
11 62.0 female 26.290 0 130 southeast 27808.72510
12 23.0 male 34.400 0 130 southwest 1826.84300
13 56.0 female 39.820 0 no southeast 11090.71780
14 27.0 male 42.130 0 yes southeast 39611.75770
To only replace empty values for one column, specify the column name for the DataFrame:
df = pd.read_csv('D:/Ai Diploma/insurance_with_null.csv')
df.head(20)
Pandas uses the mean() median() and mode() methods to calculate the respective values for a
specified column:
Mean = the average value (the sum of all values divided by number of values).
df = pd.read_csv('D:/Ai Diploma/insurance_with_null.csv')
x = df["charges"].mean()
print(x)
df.fillna( {"charges" : x}, inplace = True)
df.head(5) #col 1 charge become mean
13260.776618008234
Median = the value in the middle, after you have sorted all values ascending.
df = pd.read_csv('D:/Ai Diploma/insurance_with_null.csv')
x = df["charges"].median()
print(x)
df.fillna( {"charges" : x}, inplace = True)
df.head(5) #col 1 charge become median
9382.033
x = df["charges"].mode()[0]
print(x)
df.fillna( {"charges" : x}, inplace = True)
df.head(5) #col 1 charge become mode
1639.5631
To fix it, you have two options: remove the rows, or convert all cells in the columns into the same
format.
df = pd.read_csv('D:/Ai Diploma/insurance_with_wrong_fotmate.csv')
print(df.isnull().sum())
df.head(10)
age 0
sex 0
bmi 0
children 0
smoker 0
region 0
charges 0
dtype: int64
In our example, it is most likely a typo, and the value should be "45" instead of "450", and we
could just insert "45" in row 7:
df = pd.read_csv('D:/Ai Diploma/insurance_with_wrong_fotmate.csv')
df.loc[1,"age"] = 20
df.head(2)
For small data sets you might be able to replace the wrong data one by one, but not for big data
sets.
To replace wrong data for larger data sets you can create some rules, e.g. set some boundaries
for legal values, and replace any values that are outside of the boundaries.
df = pd.read_csv('D:/Ai Diploma/insurance_with_wrong_fotmate.csv')
m = int( df["age"].mean())
print(m)
for x in df.index:
if df.loc[x, "age"] > 90:
df.loc[x, "age"] = m
df.head(6)
50
Removing Rows
Another way of handling wrong data is to remove the rows that contains wrong data.
This way you do not have to find out what to replace them with, and there is a good chance you
do not need them to do your analyses.
for x in df.index:
if df.loc[x, "bmi"] > 50:
df.drop(x, inplace = True)
df.head() #3
age sex bmi children smoker region charges
0 19 female 27.90 0 yes southwest 16884.9240
1 50 male 33.77 1 no mk 1725.5523
2 28 male 33.00 56 no southeast 4449.4620
4 32 male 28.88 0 5 northwest 5.6551
5 50 female 25.74 0 no kjguki 3756.6216
Discovering Duplicates
Duplicate rows are rows that have been registered more than one time.
By taking a look at our test data set, we can assume that row 11 and 12 are duplicates.
df.duplicated().sum()
df.drop_duplicates(inplace = True)
df.duplicated().sum()
The corr() method calculates the relationship between each column in your data set.
# ExampleGet
# Show the relationship between the columns:
df=pd.read_csv("D:/Ai Diploma/data.csv")
df.corr() #[-1,1]
• 1 means that there is a 1 to 1 relationship (a perfect correlation), and for this data
set, each time a value went up in the first column, the other one went up as well.
• 0.9 is also a good relationship, and if you increase one value, the other will probably
increase as well.
• 0.9 would be just as good relationship as 0.9, but if you increase one value, the other
will probably go down.
• 0.2 means NOT a good relationship, meaning that if one value goes up does not
mean that the other will
Pandas - Plotting
Pandas uses the plot() method to create diagrams.
We can use Pyplot, a submodule of the Matplotlib library to visualize the diagram on the screen.
Read more about Matplotlib The source code for Matplotlib is located at this
df = pd.read_csv('D:/Ai Diploma/data.csv')
df.plot()
plt.show()
Scatter Plot
Specify that you want a scatter plot with the kind argument:
kind = 'scatter'
In the example below we will use "Duration" for the x-axis and "Calories" for the y-axis.
x = 'Duration', y = 'Calories'
df = pd.read_csv('D:/Ai Diploma/data.csv')
plt.show()
Remember: In the previous example, we learned that the correlation between "Duration" and
"Calories" was 0.922721, and we concluded with the fact that higher duration means more
calories burned.
Let's create another scatterplot, where there is a bad relationship between the columns, like
"Duration" and "Maxpulse", with the correlation 0.009403:
plt.show()
Histogram
Use the kind argument to specify that you want a histogram:
kind = 'hist'
A histogram shows us the frequency of each interval, e.g. how many workouts lasted between
50 and 60 minutes?
In the example below we will use the "Duration" column to create the histogram:
df["Duration"].plot(kind = 'hist')
<Axes: ylabel='Frequency'>
conclusion
A Pandas cheat sheet is an invaluable tool for data scientists, analysts, and developers working
with data in Python. By summarizing essential functions and commands, it helps users quickly
recall syntax, methods, and operations, improving productivity and efficiency. Key areas typically
covered in a Pandas cheat sheet include data import/export, DataFrame and Series
manipulation, filtering, sorting, aggregation, and handling missing data. With a cheat sheet on
hand, users can navigate complex data transformation tasks with greater ease, reinforcing their
knowledge of Pandas and streamlining their workflow. Whether you're a beginner or an
experienced user, a Pandas cheat sheet can be a quick reference to master data manipulation
and speed up your data analysis tasks.