0% found this document useful (0 votes)

37 views19 pages

Pandas Cheat Sheet

The document is a cheat sheet for the Pandas library in Python, detailing its purpose, installation, and functionalities for data analysis. It covers key concepts such as Series and DataFrames, methods for reading CSV files, and techniques for cleaning data. The cheat sheet serves as a quick reference for users to effectively utilize Pandas for data manipulation and analysis.

Uploaded by

mgamed2002

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

37 views19 pages

Pandas Cheat Sheet

Uploaded by

mgamed2002

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 19

Pandas Cheat Sheet

By Ammar gamal

AI Student at Kafrelsheikh University

Focus: data sets using Python

---------------------------

Pandas Introduction
- What is Pandas?
Pandas is a Python library used for working with data sets.

It has functions for analyzing, cleaning, exploring, and Data processing.

The name "Pandas" has a reference to both "Panel Data", and "Python Data Analysis" and was
created by Wes McKinney in 2008.

- Why Use Pandas?

Pandas allows us to analyze big data and make conclusions based on statistical theories.

Pandas can clean messy data sets, and make them readable and relevant.

Relevant data is very important in data science.

- What Can Pandas Do?

Pandas gives you answers about the data. Like:

Is there a correlation between two or more columns? What is average value? Max value? Min
value? Pandas are also able to delete rows that are not relevant, or contains wrong values, like
empty or NULL values. This is called cleaning the data.

- Where is the Pandas Codebase?

The source code for Pandas is located at this github repository

https://github.com/pandas-dev/pandas
Pandas Getting Started
- Installation of Pandas
If you have Python and PIP already installed on a system, then installation of Pandas is very easy.

Install it using this command:

C:\Users\Your Name>pip install pandas

If this command fails, then use a python distribution that already has Pandas installed like,
Anaconda, Spyder etc.

Import Pandas as pd
Pandas is usually imported under the pd alias.

import pandas as pd # type: ignore

# Checking Pandas Version
# The version string is stored under __version__ attribute.
print ( pd.__version__)

2.2.2

methods pandas
print(dir(pd))

['ArrowDtype', 'BooleanDtype', 'Categorical', 'CategoricalDtype',

'CategoricalIndex', 'DataFrame', 'DateOffset', 'DatetimeIndex',
'DatetimeTZDtype', 'ExcelFile', 'ExcelWriter', 'Flags',
'Float32Dtype', 'Float64Dtype', 'Grouper', 'HDFStore', 'Index',
'IndexSlice', 'Int16Dtype', 'Int32Dtype', 'Int64Dtype', 'Int8Dtype',
'Interval', 'IntervalDtype', 'IntervalIndex', 'MultiIndex', 'NA',
'NaT', 'NamedAgg', 'Period', 'PeriodDtype', 'PeriodIndex',
'RangeIndex', 'Series', 'SparseDtype', 'StringDtype', 'Timedelta',
'TimedeltaIndex', 'Timestamp', 'UInt16Dtype', 'UInt32Dtype',
'UInt64Dtype', 'UInt8Dtype', '__all__', '__builtins__', '__cached__',
'__doc__', '__docformat__', '__file__', '__git_version__',
'__loader__', '__name__', '__package__', '__path__', '__spec__',
'__version__', '_built_with_meson', '_config', '_is_numpy_dev',
'_libs', '_pandas_datetime_CAPI', '_pandas_parser_CAPI', '_testing',
'_typing', '_version_meson', 'annotations', 'api', 'array', 'arrays',
'bdate_range', 'compat', 'concat', 'core', 'crosstab', 'cut',
'date_range', 'describe_option', 'errors', 'eval', 'factorize',
'from_dummies', 'get_dummies', 'get_option', 'infer_freq',
'interval_range', 'io', 'isna', 'isnull', 'json_normalize',
'lreshape', 'melt', 'merge', 'merge_asof', 'merge_ordered', 'notna',
'notnull', 'offsets', 'option_context', 'options', 'pandas',
'period_range', 'pivot', 'pivot_table', 'plotting', 'qcut',
'read_clipboard', 'read_csv', 'read_excel', 'read_feather',
'read_fwf', 'read_gbq', 'read_hdf', 'read_html', 'read_json',
'read_orc', 'read_parquet', 'read_pickle', 'read_sas', 'read_spss',
'read_sql', 'read_sql_query', 'read_sql_table', 'read_stata',
'read_table', 'read_xml', 'reset_option', 'set_eng_float_format',
'set_option', 'show_versions', 'test', 'testing', 'timedelta_range',
'to_datetime', 'to_numeric', 'to_pickle', 'to_timedelta', 'tseries',
'unique', 'util', 'value_counts', 'wide_to_long']

Pandas Series
What is a Series?
A Pandas Series is like a column in a table.

It is a one-dimensionalarray holding data of any type.

#pd.Series( data = list , index = list)

std_mark = [399,380,389,345]
x = pd.Series(std_mark)
x

0 399
1 380
2 389
3 345
dtype: int64

-Labels
If nothing else is specified, the values are labeled with their index number. First value has index
0, second value has index 1 etc.

This label can be used to access a specified value.

# Example
# Return the first value of the Series:
print(x[0])

399
- Create Labels
With the index argument, you can name your own labels.

student = ["Mahmoud","mohamed","mohsen","hamda"]
x= pd.Series(data = std_mark, index= student)
x

Mahmoud 399
mohamed 380
mohsen 389
hamda 345
dtype: int64

# When you have created labels, you can access an item by referring to
the label.
print(x["hamda"])

345

-Key/Value Objects as Series

You can also use a key/value object, like a dictionary, when creating a Series.

calories = {"day1": 420, "day2": 380, "day3": 390}

myvar = pd.Series(calories)

print(myvar)

day1 420
day2 380
day3 390
dtype: int64

Note: The keys of the dictionary become the labels.

# To select only some of the items in the dictionary,
# use the index argument and specify only the items you want to
include in the Series.
myvar = pd.Series(calories, index = ["day1", "day2"])
print(myvar)

day1 420
day2 380
dtype: int64
Pandas DataFrames
What is a DataFrame?
A Pandas DataFrame is a 2 dimensional data structure, like a 2 dimensional array, or a table with
rows and columns.

print(pd.DataFrame())

Empty DataFrame
Columns: []
Index: []

#data column row

students = { 'std_id' :[1204,1205,1206,1207],
'std_name' :['Ammar','Gamal','Mone','Ahmed'],
'std_marks' :[400,405,408,350],
'percentage':[97.56,98.78,99.5,85.36]}
df = pd.DataFrame(students)
df

std_id std_name std_marks percentage

0 1204 Ammar 400 97.56
1 1205 Gamal 405 98.78
2 1206 Mone 408 99.50
3 1207 Ahmed 350 85.36

Locate Row
As you can see from the result above, the DataFrame is like a table with rows and columns.

Pandas use the loc() attribute to return one or more specified row(s)

#refer to the row index:

print(df.loc[0])

std_id 1204
std_name Ammar
std_marks 400
percentage 97.56
Name: 0, dtype: object

# Example
# Return row 0 and 1:
#use a list of indexes:
print(df.loc[[0, 1]])
# Note: When using [], the result is a Pandas DataFrame.
std_id std_name std_marks percentage
0 1204 Ammar 400 97.56
1 1205 Gamal 405 98.78

data = {
#column Row
"calories": [420, 380, 390],
"duration": [50, 40, 45]
}

df = pd.DataFrame(data, index = ["day1", "day2", "day3"])

calories duration
day1 420 50
day2 380 40
day3 390 45

#refer to the named index:

df.loc["day2"]

calories 380
duration 40
Name: day2, dtype: int64

Pandas Read CSV

Read CSV Files
A simple way to store big data sets is to use CSV files (comma separated files).

CSV files contains plain text and is a well know format that can be read by everyone including
Pandas.

In our examples we will be using a CSV file called 'data.csv'.

df = pd.read_csv('D:/Ai Diploma/insurance.csv')
df.head()
# if the number of rows is not specified, the head() method will
return the top 5 rows.

age sex bmi children smoker region charges

0 19.0 female 27.900 0 yes southwest 16884.92400
1 18.0 male 33.770 1 no southeast 1725.55230
2 28.0 male 33.000 3 no southeast 4449.46200
3 33.0 male 22.705 0 no northwest 21984.47061
4 32.0 male 28.880 0 no northwest 3866.85520

Info About the Data

The DataFrames object has a method called info() , that gives you more information about the
data set.

df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1338 entries, 0 to 1337
Data columns (total 7 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 age 1337 non-null float64
1 sex 1337 non-null object
2 bmi 1337 non-null float64
3 children 1338 non-null int64
4 smoker 1337 non-null object
5 region 1338 non-null object
6 charges 1337 non-null float64
dtypes: float64(3), int64(1), object(3)
memory usage: 73.3+ KB

Pandas - Cleaning Data

Data Cleaning
Data cleaning means fixing bad data in your data set.

Bad data could be:

• Empty cells
• Data in wrong format
• Wrong data
• Duplicates In this tutorial you will learn how to deal with all of them.

Pandas - Cleaning Empty Cells

Empty cells can potentially give you a wrong result when you analyze data.
df = pd.read_csv('D:/Ai Diploma/insurance_with_null.csv')
df.isnull().sum()

age 1
sex 1
bmi 2
children 0
smoker 4
region 0
charges 2
dtype: int64

df = pd.read_csv('D:/Ai Diploma/insurance_with_null.csv')
df.head(11)

age sex bmi children smoker region charges

0 19.0 female 27.900 0 yes southwest 16884.92400
1 18.0 male 33.770 1 no southeast NaN
2 28.0 male 33.000 3 no southeast 4449.46200
3 33.0 male 22.705 0 no northwest 21984.47061
4 32.0 male 28.880 0 no northwest 3866.85520
5 31.0 female NaN 0 no southeast 3756.62160
6 46.0 female 33.440 1 no southeast 8240.58960
7 37.0 female 27.740 3 no northwest 7281.50560
8 37.0 male 29.830 2 no northeast 6406.41070
9 60.0 NaN 25.840 0 no northwest 28923.13692
10 25.0 male 26.220 0 NaN northeast 2721.32080

Remove Rows
One way to deal with empty cells is to remove rows that contain empty cells.

This is usually OK, since data sets can be very big, and removing a few rows will not have a big
impact on the result.

# ExampleGet
# Return a new Data Frame with no empty cells:

df = pd.read_csv('D:/Ai Diploma/insurance_with_null.csv')
new_df = df.dropna()

new_df.head(25)
# remove 1,5,9,10,11,12 ,23,31

age sex bmi children smoker region charges

0 19.0 female 27.900 0 yes southwest 16884.92400
2 28.0 male 33.000 3 no southeast 4449.46200
3 33.0 male 22.705 0 no northwest 21984.47061
4 32.0 male 28.880 0 no northwest 3866.85520
6 46.0 female 33.440 1 no southeast 8240.58960
7 37.0 female 27.740 3 no northwest 7281.50560
8 37.0 male 29.830 2 no northeast 6406.41070
13 56.0 female 39.820 0 no southeast 11090.71780
14 27.0 male 42.130 0 yes southeast 39611.75770
15 19.0 male 24.600 1 no southwest 1837.23700
16 52.0 female 30.780 1 no northeast 10797.33620
17 23.0 male 23.845 0 no northeast 2395.17155
18 56.0 male 40.300 0 no southwest 10602.38500
19 30.0 male 35.300 0 yes southwest 36837.46700
20 60.0 female 36.005 0 no northeast 13228.84695
21 30.0 female 32.400 1 no southwest 4149.73600
22 18.0 male 34.100 0 no southeast 1137.01100
24 37.0 male 28.025 2 no northwest 6203.90175
25 59.0 female 27.720 3 no southeast 14001.13380
26 63.0 female 23.085 0 no northeast 14451.83515
27 55.0 female 32.775 2 no northwest 12268.63225
28 23.0 male 17.385 1 no northwest 2775.19215
29 31.0 male 36.300 2 yes southwest 38711.00000
30 22.0 male 35.600 0 yes southwest 35585.57600
32 19.0 female 28.600 5 no southwest 4687.79700

Note: By default, the dropna() method returns a new DataFrame, and

will not change the original.
If you want to change the original DataFrame, use the inplace = True argument:

df.dropna(inplace=True)
df.head(5)

age sex bmi children smoker region charges

Replace Empty Values

Another way of dealing with empty cells is to insert a new value instead.

This way you do not have to delete entire rows just because of some empty cells.

The fillna() method allows us to replace empty cells with a value:

df = pd.read_csv('D:/Ai Diploma/insurance_with_null.csv')

df.fillna(130, inplace = True)

df.head(15)
age sex bmi children smoker region charges
0 19.0 female 27.900 0 yes southwest 16884.92400
1 18.0 male 33.770 1 no southeast 130.00000
2 28.0 male 33.000 3 no southeast 4449.46200
3 33.0 male 22.705 0 no northwest 21984.47061
4 32.0 male 28.880 0 no northwest 3866.85520
5 31.0 female 130.000 0 no southeast 3756.62160
6 46.0 female 33.440 1 no southeast 8240.58960
7 37.0 female 27.740 3 no northwest 7281.50560
8 37.0 male 29.830 2 no northeast 6406.41070
9 60.0 130 25.840 0 no northwest 28923.13692
10 25.0 male 26.220 0 130 northeast 2721.32080
11 62.0 female 26.290 0 130 southeast 27808.72510
12 23.0 male 34.400 0 130 southwest 1826.84300
13 56.0 female 39.820 0 no southeast 11090.71780
14 27.0 male 42.130 0 yes southeast 39611.75770

Replace Only For Specified Columns

The example above replaces all empty cells in the whole Data Frame.

To only replace empty values for one column, specify the column name for the DataFrame:

df = pd.read_csv('D:/Ai Diploma/insurance_with_null.csv')

df.fillna({"charges": 1300}, inplace = True) #1

df.fillna({"sex" :"male"}, inplace = True) #9
df.fillna({"smoker" :"no"}, inplace = True) #11,12
df.fillna({"bmi" :33.23}, inplace = True) #5

df.head(20)

age sex bmi children smoker region charges

0 19.0 female 27.900 0 yes southwest 16884.92400
1 18.0 male 33.770 1 no southeast 1300.00000
2 28.0 male 33.000 3 no southeast 4449.46200
3 33.0 male 22.705 0 no northwest 21984.47061
4 32.0 male 28.880 0 no northwest 3866.85520
5 31.0 female 33.230 0 no southeast 3756.62160
6 46.0 female 33.440 1 no southeast 8240.58960
7 37.0 female 27.740 3 no northwest 7281.50560
8 37.0 male 29.830 2 no northeast 6406.41070
9 60.0 male 25.840 0 no northwest 28923.13692
10 25.0 male 26.220 0 no northeast 2721.32080
11 62.0 female 26.290 0 no southeast 27808.72510
12 23.0 male 34.400 0 no southwest 1826.84300
13 56.0 female 39.820 0 no southeast 11090.71780
14 27.0 male 42.130 0 yes southeast 39611.75770
15 19.0 male 24.600 1 no southwest 1837.23700
16 52.0 female 30.780 1 no northeast 10797.33620
17 23.0 male 23.845 0 no northeast 2395.17155
18 56.0 male 40.300 0 no southwest 10602.38500
19 30.0 male 35.300 0 yes southwest 36837.46700

Replace Using Mean, Median, or Mode

A common way to replace empty cells, is to calculate the mean, median or mode value of the
column.

Pandas uses the mean() median() and mode() methods to calculate the respective values for a
specified column:

Mean = the average value (the sum of all values divided by number of values).

df = pd.read_csv('D:/Ai Diploma/insurance_with_null.csv')

x = df["charges"].mean()
print(x)
df.fillna( {"charges" : x}, inplace = True)
df.head(5) #col 1 charge become mean

13260.776618008234

age sex bmi children smoker region charges

0 19.0 female 27.900 0 yes southwest 16884.924000
1 18.0 male 33.770 1 no southeast 13260.776618
2 28.0 male 33.000 3 no southeast 4449.462000
3 33.0 male 22.705 0 no northwest 21984.470610
4 32.0 male 28.880 0 no northwest 3866.855200

Median = the value in the middle, after you have sorted all values ascending.

df = pd.read_csv('D:/Ai Diploma/insurance_with_null.csv')

x = df["charges"].median()
print(x)
df.fillna( {"charges" : x}, inplace = True)
df.head(5) #col 1 charge become median

9382.033

age sex bmi children smoker region charges

0 19.0 female 27.900 0 yes southwest 16884.92400
1 18.0 male 33.770 1 no southeast 9382.03300
2 28.0 male 33.000 3 no southeast 4449.46200
3 33.0 male 22.705 0 no northwest 21984.47061
4 32.0 male 28.880 0 no northwest 3866.85520

Mode = the value that appears most frequently.

df = pd.read_csv('D:/Ai Diploma/insurance_with_null.csv')

x = df["charges"].mode()[0]
print(x)
df.fillna( {"charges" : x}, inplace = True)
df.head(5) #col 1 charge become mode

1639.5631

age sex bmi children smoker region charges

0 19.0 female 27.900 0 yes southwest 16884.92400
1 18.0 male 33.770 1 no southeast 1639.56310
2 28.0 male 33.000 3 no southeast 4449.46200
3 33.0 male 22.705 0 no northwest 21984.47061
4 32.0 male 28.880 0 no northwest 3866.85520

Data of Wrong Format

Cells with data of wrong format can make it difficult, or even impossible, to analyze data.

To fix it, you have two options: remove the rows, or convert all cells in the columns into the same
format.

df = pd.read_csv('D:/Ai Diploma/insurance_with_wrong_fotmate.csv')
print(df.isnull().sum())
df.head(10)

age 0
sex 0
bmi 0
children 0
smoker 0
region 0
charges 0
dtype: int64

age sex bmi children smoker region charges

0 19 female 27.90 0 yes southwest 16884.92400
1 6845 male 33.77 1 no mk 1725.55230
2 28 male 33.00 56 no southeast 4449.46200
3 254 male 36596.00 0 no 135 21984.47061
4 32 male 28.88 0 5 northwest 5.65510
5 7488 female 25.74 0 no kjguki 3756.62160
6 46 female 33.44 1 no southeast 8240.58960
7 37 female 27.74 3 no northwest 7281.50560
8 37 male 29.83 2 no northeast 6406.41070
9 60 female 25.84 236 no northwest 28923.13692
Replacing Values
One way to fix wrong values is to replace them with something else.

In our example, it is most likely a typo, and the value should be "45" instead of "450", and we
could just insert "45" in row 7:

df = pd.read_csv('D:/Ai Diploma/insurance_with_wrong_fotmate.csv')
df.loc[1,"age"] = 20
df.head(2)

age sex bmi children smoker region charges

0 19 female 27.90 0 yes southwest 16884.9240
1 20 male 33.77 1 no mk 1725.5523

For small data sets you might be able to replace the wrong data one by one, but not for big data
sets.

To replace wrong data for larger data sets you can create some rules, e.g. set some boundaries
for legal values, and replace any values that are outside of the boundaries.

df = pd.read_csv('D:/Ai Diploma/insurance_with_wrong_fotmate.csv')
m = int( df["age"].mean())
print(m)
for x in df.index:
if df.loc[x, "age"] > 90:
df.loc[x, "age"] = m
df.head(6)

age sex bmi children smoker region charges

0 19 female 27.90 0 yes southwest 16884.92400
1 50 male 33.77 1 no mk 1725.55230
2 28 male 33.00 56 no southeast 4449.46200
3 50 male 36596.00 0 no 135 21984.47061
4 32 male 28.88 0 5 northwest 5.65510
5 50 female 25.74 0 no kjguki 3756.62160

Removing Rows
Another way of handling wrong data is to remove the rows that contains wrong data.

This way you do not have to find out what to replace them with, and there is a good chance you
do not need them to do your analyses.

for x in df.index:
if df.loc[x, "bmi"] > 50:
df.drop(x, inplace = True)
df.head() #3
age sex bmi children smoker region charges
0 19 female 27.90 0 yes southwest 16884.9240
1 50 male 33.77 1 no mk 1725.5523
2 28 male 33.00 56 no southeast 4449.4620
4 32 male 28.88 0 5 northwest 5.6551
5 50 female 25.74 0 no kjguki 3756.6216

Discovering Duplicates
Duplicate rows are rows that have been registered more than one time.

By taking a look at our test data set, we can assume that row 11 and 12 are duplicates.

To discover duplicates, we can use the duplicated() method.

The duplicated() method returns a Boolean values for each row:

df.duplicated().sum()

df.drop_duplicates(inplace = True)
df.duplicated().sum()

Pandas - Data Correlations

Finding Relationships A great aspect of the Pandas module is the corr() method.

The corr() method calculates the relationship between each column in your data set.

The examples in this page uses a CSV file called: 'data.csv'.

# ExampleGet
# Show the relationship between the columns:
df=pd.read_csv("D:/Ai Diploma/data.csv")
df.corr() #[-1,1]

Duration Pulse Maxpulse Calories

Duration 1.000000 -0.155408 0.009403 0.922717
Pulse -0.155408 1.000000 0.786535 0.025121
Maxpulse 0.009403 0.786535 1.000000 0.203813
Calories 0.922717 0.025121 0.203813 1.000000

Note: The corr() method ignores "not numeric" columns.

Result Explained
The Result of the corr() method is a table with a lot of numbers that represents how well the
relationship is between two columns.

The number varies from -1 to 1.

• 1 means that there is a 1 to 1 relationship (a perfect correlation), and for this data
set, each time a value went up in the first column, the other one went up as well.

• 0.9 is also a good relationship, and if you increase one value, the other will probably
increase as well.

• 0.9 would be just as good relationship as 0.9, but if you increase one value, the other
will probably go down.

• 0.2 means NOT a good relationship, meaning that if one value goes up does not
mean that the other will

Pandas - Plotting
Pandas uses the plot() method to create diagrams.

We can use Pyplot, a submodule of the Matplotlib library to visualize the diagram on the screen.

Read more about Matplotlib The source code for Matplotlib is located at this

github repository https://github.com/matplotlib/matplotlib

import matplotlib.pyplot as plt

df = pd.read_csv('D:/Ai Diploma/data.csv')

df.plot()

plt.show()
Scatter Plot
Specify that you want a scatter plot with the kind argument:

kind = 'scatter'

A scatter plot needs an x- and a y-axis.

In the example below we will use "Duration" for the x-axis and "Calories" for the y-axis.

Include the x and y arguments like this:

x = 'Duration', y = 'Calories'

df = pd.read_csv('D:/Ai Diploma/data.csv')

df.plot(kind = 'scatter', x = 'Duration', y = 'Calories')

plt.show()
Remember: In the previous example, we learned that the correlation between "Duration" and
"Calories" was 0.922721, and we concluded with the fact that higher duration means more
calories burned.

By looking at the scatterplot, I will agree.

Let's create another scatterplot, where there is a bad relationship between the columns, like
"Duration" and "Maxpulse", with the correlation 0.009403:

df.plot(kind = 'scatter', x = 'Duration', y = 'Maxpulse')

plt.show()
Histogram
Use the kind argument to specify that you want a histogram:

kind = 'hist'

A histogram needs only one column.

A histogram shows us the frequency of each interval, e.g. how many workouts lasted between
50 and 60 minutes?

In the example below we will use the "Duration" column to create the histogram:

df["Duration"].plot(kind = 'hist')

<Axes: ylabel='Frequency'>
conclusion
A Pandas cheat sheet is an invaluable tool for data scientists, analysts, and developers working
with data in Python. By summarizing essential functions and commands, it helps users quickly
recall syntax, methods, and operations, improving productivity and efficiency. Key areas typically
covered in a Pandas cheat sheet include data import/export, DataFrame and Series
manipulation, filtering, sorting, aggregation, and handling missing data. With a cheat sheet on
hand, users can navigate complex data transformation tasks with greater ease, reinforcing their
knowledge of Pandas and streamlining their workflow. Whether you're a beginner or an
experienced user, a Pandas cheat sheet can be a quick reference to master data manipulation
and speed up your data analysis tasks.

For Assignment-3 (Final - Pandas - Lab)
No ratings yet
For Assignment-3 (Final - Pandas - Lab)
40 pages
Tutorial Data Visualization Pandas Matplotlib Seaborn
No ratings yet
Tutorial Data Visualization Pandas Matplotlib Seaborn
32 pages
Importing Files Through Pandas
No ratings yet
Importing Files Through Pandas
16 pages
Unit - 1 - Ohs352-Project Report Writing
No ratings yet
Unit - 1 - Ohs352-Project Report Writing
23 pages
Essentials of Strategic Management The Quest For Competitive Advantage 8th Edition Gamble Test Bank Available Instantly
No ratings yet
Essentials of Strategic Management The Quest For Competitive Advantage 8th Edition Gamble Test Bank Available Instantly
341 pages
CHP 8 Pandas
No ratings yet
CHP 8 Pandas
49 pages
Asfasdas
No ratings yet
Asfasdas
36 pages
Pandas Cheat Sheet CN
No ratings yet
Pandas Cheat Sheet CN
4 pages
FDS Notes Unit-4
No ratings yet
FDS Notes Unit-4
30 pages
Cheat Sheet
No ratings yet
Cheat Sheet
10 pages
Data Science Notes Unit-1 Part - 2
No ratings yet
Data Science Notes Unit-1 Part - 2
22 pages
Short Notes On Pandas
No ratings yet
Short Notes On Pandas
21 pages
12 IP Unit 1 Python Pandas I (Part 3 Dataframes) Notes
100% (1)
12 IP Unit 1 Python Pandas I (Part 3 Dataframes) Notes
24 pages
Pandas Complete Notes
No ratings yet
Pandas Complete Notes
105 pages
MOD-3 Dap
No ratings yet
MOD-3 Dap
41 pages
Pandas - Ipynb - Colab
No ratings yet
Pandas - Ipynb - Colab
8 pages
Pandas Notes
No ratings yet
Pandas Notes
5 pages
Data Handling Using Pandas-1
No ratings yet
Data Handling Using Pandas-1
60 pages
Pandas Module (Part-I)
No ratings yet
Pandas Module (Part-I)
36 pages
Lecture 7 Understanding Dataframes in Python and R
No ratings yet
Lecture 7 Understanding Dataframes in Python and R
17 pages
2 Pandas
No ratings yet
2 Pandas
22 pages
Exercise 3
No ratings yet
Exercise 3
25 pages
Pandas
No ratings yet
Pandas
21 pages
Notes On Pandas.
No ratings yet
Notes On Pandas.
7 pages
Pandas Cheat Sheet
85% (13)
Pandas Cheat Sheet
2 pages
Pandas AI
No ratings yet
Pandas AI
14 pages
Unit 1 Python Pandas
No ratings yet
Unit 1 Python Pandas
20 pages
Data Science - Sec3
No ratings yet
Data Science - Sec3
27 pages
k3 Ve Service Manual
26% (19)
k3 Ve Service Manual
2 pages
Pandas Notes
No ratings yet
Pandas Notes
10 pages
Introduction To Pandas
No ratings yet
Introduction To Pandas
14 pages
Unit 3
No ratings yet
Unit 3
10 pages
Pandas
No ratings yet
Pandas
26 pages
Pandas
No ratings yet
Pandas
21 pages
Mdad - Numpy ML
No ratings yet
Mdad - Numpy ML
85 pages
Pandas in Python
No ratings yet
Pandas in Python
59 pages
Exercise 3
No ratings yet
Exercise 3
12 pages
Pandas Cheat Sheet........
No ratings yet
Pandas Cheat Sheet........
11 pages
Pandas
No ratings yet
Pandas
41 pages
Pandas DataFrame Notes
100% (1)
Pandas DataFrame Notes
10 pages
Cheat Sheet: The Pandas Dataframe Object: Preliminaries Get Your Data Into A Dataframe
100% (1)
Cheat Sheet: The Pandas Dataframe Object: Preliminaries Get Your Data Into A Dataframe
10 pages
Cheat Sheet: The Pandas Dataframe Object: Preliminaries Get Your Data Into A Dataframe
100% (1)
Cheat Sheet: The Pandas Dataframe Object: Preliminaries Get Your Data Into A Dataframe
12 pages
Pandas Merged
No ratings yet
Pandas Merged
2 pages
Pandas Handbook
No ratings yet
Pandas Handbook
33 pages
Pandas DataFrame Notes
No ratings yet
Pandas DataFrame Notes
10 pages
Pandas Cheat Sheet
100% (4)
Pandas Cheat Sheet
2 pages
Pandas
No ratings yet
Pandas
25 pages
The Pandas Library
No ratings yet
The Pandas Library
39 pages
Lab-3 Pandas Library
No ratings yet
Lab-3 Pandas Library
14 pages
Pandas 1705297450
No ratings yet
Pandas 1705297450
21 pages
Cement Mill Certificate
100% (2)
Cement Mill Certificate
1 page
Pandas Cheat Sheet
100% (1)
Pandas Cheat Sheet
2 pages
Pandas: Import
100% (1)
Pandas: Import
13 pages
Pandas - Datastructures
No ratings yet
Pandas - Datastructures
19 pages
Pandas
No ratings yet
Pandas
16 pages
Psychology and Other Disciplines
No ratings yet
Psychology and Other Disciplines
5 pages
Concrete Sheet Pile Drawingdrawing06040
100% (1)
Concrete Sheet Pile Drawingdrawing06040
4 pages
Pandas Cheat Sheet
No ratings yet
Pandas Cheat Sheet
2 pages
14 AAU - Level 6 - Test - Challenge - Unit 4
100% (13)
14 AAU - Level 6 - Test - Challenge - Unit 4
5 pages
Pandas DataFrameObject
No ratings yet
Pandas DataFrameObject
4 pages
Pandas Cheat Sheet - Python For Data Science
No ratings yet
Pandas Cheat Sheet - Python For Data Science
5 pages
Pandas
No ratings yet
Pandas
8 pages
Nissan - Resilience Strategy
0% (1)
Nissan - Resilience Strategy
2 pages
MCQ Class 2 MS Word
No ratings yet
MCQ Class 2 MS Word
11 pages
Cheat Sheet: The Pandas Dataframe Object I: Preliminaries Get Your Data Into A Dataframe
No ratings yet
Cheat Sheet: The Pandas Dataframe Object I: Preliminaries Get Your Data Into A Dataframe
12 pages
7 8 STS Handout Key
No ratings yet
7 8 STS Handout Key
9 pages
WS - 3 Class X Phy CH - 10 (Light - Refraction) - 1
No ratings yet
WS - 3 Class X Phy CH - 10 (Light - Refraction) - 1
3 pages
Physics Class Xii Project PDF
No ratings yet
Physics Class Xii Project PDF
20 pages
An Analytical Method of Aircraft Gearboxes: To Predict Efficiency
No ratings yet
An Analytical Method of Aircraft Gearboxes: To Predict Efficiency
25 pages
Section08 Sorting
No ratings yet
Section08 Sorting
5 pages
Curriculum Map Subject: Science Quarter: 4 Grade Level: Grade 4 Topic: Earth and Space
100% (1)
Curriculum Map Subject: Science Quarter: 4 Grade Level: Grade 4 Topic: Earth and Space
5 pages
SPM Swivels Operation Instruction and Service Manual
No ratings yet
SPM Swivels Operation Instruction and Service Manual
44 pages
Sheet Metal Shop Exp 1.3
No ratings yet
Sheet Metal Shop Exp 1.3
30 pages
Ucc2817, Ucc2818, Ucc3817 and Ucc3818 Bicmos Power Factor Pregulator
No ratings yet
Ucc2817, Ucc2818, Ucc3817 and Ucc3818 Bicmos Power Factor Pregulator
45 pages
Inspection Report Shore Quantity Report Ullage Report Time Sheet / Time Log Sample Report Quality Report
No ratings yet
Inspection Report Shore Quantity Report Ullage Report Time Sheet / Time Log Sample Report Quality Report
21 pages
Topic 2
No ratings yet
Topic 2
17 pages
Employee Retention, Engagement and Careers
No ratings yet
Employee Retention, Engagement and Careers
16 pages
Invoice 10
No ratings yet
Invoice 10
1 page
Rapid Serial Visual Presentation in Dynamic Graph Visualization
No ratings yet
Rapid Serial Visual Presentation in Dynamic Graph Visualization
8 pages
R1 Nokia
No ratings yet
R1 Nokia
6 pages
159.52 101870341003 101870349999 Heating Climatic Unit
No ratings yet
159.52 101870341003 101870349999 Heating Climatic Unit
5 pages
New Vendor Form
No ratings yet
New Vendor Form
1 page
Alter Table: Table - Name ADD Column - Name Datatype
No ratings yet
Alter Table: Table - Name ADD Column - Name Datatype
5 pages
MasterCast 222 TDS-974770
No ratings yet
MasterCast 222 TDS-974770
2 pages
Study Guide Chapter 8. The Teaching of Araling Panlipunan
No ratings yet
Study Guide Chapter 8. The Teaching of Araling Panlipunan
5 pages
Standard Specification For Castings, Austenitic-Ferritic (Duplex) Stainless Steel, For Pressure-Containing Parts
No ratings yet
Standard Specification For Castings, Austenitic-Ferritic (Duplex) Stainless Steel, For Pressure-Containing Parts
6 pages
Learning Pandas 2.0: A Comprehensive Guide to Data Manipulation and Analysis for Data Scientists and Machine Learning Professionals
From Everand
Learning Pandas 2.0: A Comprehensive Guide to Data Manipulation and Analysis for Data Scientists and Machine Learning Professionals
Matthew Rosch
No ratings yet
Coding Interview Questions and Answers
From Everand
Coding Interview Questions and Answers
Chinmoy Mukherjee
No ratings yet
Advanced C Concepts and Programming: First Edition
From Everand
Advanced C Concepts and Programming: First Edition
Gayatri
3/5 (1)
Mastering Pandas in Python: Course Book
From Everand
Mastering Pandas in Python: Course Book
Pedro Martins
No ratings yet

Pandas Cheat Sheet

Uploaded by

Pandas Cheat Sheet

Uploaded by

Pandas Cheat Sheet

AI Student at Kafrelsheikh University

Focus: data sets using Python

It has functions for analyzing, cleaning, exploring, and Data processing.

- Why Use Pandas?

Relevant data is very important in data science.

- What Can Pandas Do?

- Where is the Pandas Codebase?

Install it using this command:

C:\Users\Your Name>pip install pandas

import pandas as pd # type: ignore

['ArrowDtype', 'BooleanDtype', 'Categorical', 'CategoricalDtype',

It is a one-dimensionalarray holding data of any type.

#pd.Series( data = list , index = list)

This label can be used to access a specified value.

-Key/Value Objects as Series

calories = {"day1": 420, "day2": 380, "day3": 390}

Note: The keys of the dictionary become the labels.

#data column row

std_id std_name std_marks percentage

#refer to the row index:

df = pd.DataFrame(data, index = ["day1", "day2", "day3"])

#refer to the named index:

Pandas Read CSV

In our examples we will be using a CSV file called 'data.csv'.

age sex bmi children smoker region charges

Info About the Data

Pandas - Cleaning Data

Bad data could be:

Pandas - Cleaning Empty Cells

age sex bmi children smoker region charges

age sex bmi children smoker region charges

Note: By default, the dropna() method returns a new DataFrame, and

age sex bmi children smoker region charges

Replace Empty Values

The fillna() method allows us to replace empty cells with a value:

df.fillna(130, inplace = True)

Replace Only For Specified Columns

df.fillna({"charges": 1300}, inplace = True) #1

age sex bmi children smoker region charges

Replace Using Mean, Median, or Mode

age sex bmi children smoker region charges

age sex bmi children smoker region charges

Mode = the value that appears most frequently.

age sex bmi children smoker region charges

Data of Wrong Format

age sex bmi children smoker region charges

age sex bmi children smoker region charges

age sex bmi children smoker region charges

To discover duplicates, we can use the duplicated() method.

The duplicated() method returns a Boolean values for each row:

Pandas - Data Correlations

The examples in this page uses a CSV file called: 'data.csv'.

Duration Pulse Maxpulse Calories

Note: The corr() method ignores "not numeric" columns.

The number varies from -1 to 1.

github repository https://github.com/matplotlib/matplotlib

import matplotlib.pyplot as plt

A scatter plot needs an x- and a y-axis.

Include the x and y arguments like this:

df.plot(kind = 'scatter', x = 'Duration', y = 'Calories')

By looking at the scatterplot, I will agree.

df.plot(kind = 'scatter', x = 'Duration', y = 'Maxpulse')

A histogram needs only one column.

You might also like