Pandas Notes Basic To Advance
Pandas Notes Basic To Advance
Pandas Library
MANDAR PATIL
2/20/24, 10:23 AM part8_pandas-Copy1
Pandas is the main Python library for data science and analytics. Whether you are building a
machine learning model or just want to take a quick look at your data, you will use it. For
this part, we are going to go over the main concepts of Pandas.
Let's begin
In [1]: # Import
import numpy as np
import pandas as pd
We need to define a data frame with the read_csv function. This function takes a string file
path as an argument. File path is basically the path where a file is located on your system.
When I work with data, I usually put the data file in the same directory with the Python-
Jupyter Notebook file, so that just passing the file name will be sufficient. Python requires
only the name of a file if the file is in the same directory with the .py or .ipynb file you are
running.
file:///C:/Users/k/Downloads/part8_pandas-Copy1.html 1/20
2/20/24, 10:23 AM part8_pandas-Copy1
Out[3]: model
mpg cylinders displacement horsepower weight acceleration origin car name
year
chevrolet
0 18.0 8 307.0 130 3504 12.0 70 1 chevelle
malibu
buick
1 15.0 8 350.0 165 3693 11.5 70 1 skylark
320
plymouth
2 18.0 8 318.0 150 3436 11.0 70 1
satellite
amc rebel
3 16.0 8 304.0 150 3433 12.0 70 1
sst
ford
4 17.0 8 302.0 140 3449 10.5 70 1
torino
ford
5 15.0 8 429.0 198 4341 10.0 70 1 galaxie
500
This is a Pandas data frame. It has rows and columns. The rows have index numbers next to
them (on the left). The row index starts from 0 (like most Python objects). The columns also
have index numbers that start from 0 but they are visible to us like the row index numbers.
We will learn how to do operations with these index numbers.
chevrolet
392 27.0 4 151.0 90 2950 17.3 82 1
camaro
ford
393 27.0 4 140.0 86 2790 15.6 82 1 mustang
gl
vw
394 44.0 4 97.0 52 2130 24.6 82 2
pickup
dodge
395 32.0 4 135.0 84 2295 11.6 82 1
rampage
ford
396 28.0 4 120.0 79 2625 18.6 82 1
ranger
chevy s-
397 31.0 4 119.0 82 2720 19.4 82 1
10
file:///C:/Users/k/Downloads/part8_pandas-Copy1.html 2/20
2/20/24, 10:23 AM part8_pandas-Copy1
['mpg',
Out[6]:
'cylinders',
'displacement',
'horsepower',
'weight',
'acceleration',
'model year',
'origin',
'car name']
(398, 9)
Out[7]:
3582
Out[8]:
We can use the .info ( ) method to learn some very important things about our data frame
such as:
RangeIndex : The number of entries (rows) ad the range of their index numbers. '#' : The
index number of columns Column : Column name Non-Null Count: The number of non-null
values. Dtype : The data type of values that are held by the column. (object usually stands for
string)
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 398 entries, 0 to 397
Data columns (total 9 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 mpg 398 non-null float64
1 cylinders 398 non-null int64
2 displacement 398 non-null float64
3 horsepower 398 non-null object
4 weight 398 non-null int64
5 acceleration 398 non-null float64
6 model year 398 non-null int64
7 origin 398 non-null int64
8 car name 398 non-null object
dtypes: float64(3), int64(4), object(2)
memory usage: 28.1+ KB
The horsepower should hold numerical (integer or float) values but it says here that it has
object (string) data type. Let's see how we can find out why and solve this issue.
We can call the unique on the column name to check all unique values.
In [10]: auto_df.horsepower.unique()
file:///C:/Users/k/Downloads/part8_pandas-Copy1.html 3/20
2/20/24, 10:23 AM part8_pandas-Copy1
array(['130', '165', '150', '140', '198', '220', '215', '225', '190',
Out[10]:
'170', '160', '95', '97', '85', '88', '46', '87', '90', '113',
'200', '210', '193', '?', '100', '105', '175', '153', '180', '110',
'72', '86', '70', '76', '65', '69', '60', '80', '54', '208', '155',
'112', '92', '145', '137', '158', '167', '94', '107', '230', '49',
'75', '91', '122', '67', '83', '78', '52', '61', '93', '148',
'129', '96', '71', '98', '115', '53', '81', '79', '120', '152',
'102', '108', '68', '58', '149', '89', '63', '48', '66', '139',
'103', '125', '133', '138', '135', '142', '77', '62', '132', '84',
'64', '74', '116', '82'], dtype=object)
Looking at these values carefully, we can see that there are entries marked with '?'. Marking
an entry like this may cause the data type of the whole column to change to string because
a Pandas data frame column can only hold one type of data. If there is a string value, then all
other values under the same column must be a string.
I will get into the details on how to solve such problems in the future. For now, let's take a
look at what we can do to easily fix this. We will use the na_values= parameter of the
read_csv function to turn '?' values into NaN (missing) values. NaN values are treated as
unidentified floats by Pandas. This will turn our column to a numeric (float, in this case) data
type. This will allow us to carry out arithmetic operations easily.
In [11]: # Pass na_values like a keyword argument and set its value to '?'
auto_df = pd.read_csv('auto-mpg.csv', na_values='?')
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 398 entries, 0 to 397
Data columns (total 9 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 mpg 398 non-null float64
1 cylinders 398 non-null int64
2 displacement 398 non-null float64
3 horsepower 392 non-null float64
4 weight 398 non-null int64
5 acceleration 398 non-null float64
6 model year 398 non-null int64
7 origin 398 non-null int64
8 car name 398 non-null object
dtypes: float64(4), int64(4), object(1)
memory usage: 28.1+ KB
Here, now the horsepower column holds numerical float data as it should.
Selecting Data
For this section, we will be working with the concrete dataset.
https://www.kaggle.com/datasets/prathamtripathi/regression-with-neural-networking
It is beneficial to take a look at the data dictionary before starting. Data dictionaries are
documents that explain the meaning of variables in the dataset:
file:///C:/Users/k/Downloads/part8_pandas-Copy1.html 4/20
2/20/24, 10:23 AM part8_pandas-Copy1
Out[12]: Blast
Fly Coarse Fine
Cement Furnace Water Superplasticizer Age Strength
Ash Aggregate Aggregate
Slag
0 540.0
Out[13]:
1 540.0
2 332.5
3 332.5
4 198.6
...
1025 276.4
1026 322.2
1027 148.5
1028 159.1
1029 260.9
Name: Cement, Length: 1030, dtype: float64
This shows us the row index numbers and the values. If we want to see the result like a data
frame with column names, we can use double brakcets like this:
file:///C:/Users/k/Downloads/part8_pandas-Copy1.html 5/20
2/20/24, 10:23 AM part8_pandas-Copy1
Out[14]: Cement
0 540.0
1 540.0
2 332.5
3 332.5
4 198.6
... ...
1025 276.4
1026 322.2
1027 148.5
1028 159.1
1029 260.9
0 540.0 28
1 540.0 28
2 332.5 270
3 332.5 365
4 198.6 360
1025 276.4 28
1026 322.2 28
1027 148.5 28
1028 159.1 28
1029 260.9 28
file:///C:/Users/k/Downloads/part8_pandas-Copy1.html 6/20
2/20/24, 10:23 AM part8_pandas-Copy1
0 79.99
Out[17]:
1 61.89
2 40.27
3 41.05
4 44.30
...
1025 44.28
1026 31.18
1027 23.70
1028 32.77
1029 32.40
Name: Strength, Length: 1030, dtype: float64
In [18]: # Using .loc to access the values of a single column like a data frame
concrete_df.loc[:,['Strength']]
file:///C:/Users/k/Downloads/part8_pandas-Copy1.html 7/20
2/20/24, 10:23 AM part8_pandas-Copy1
Out[18]: Strength
0 79.99
1 61.89
2 40.27
3 41.05
4 44.30
... ...
1025 44.28
1026 31.18
1027 23.70
1028 32.77
1029 32.40
We do not have to select all rows or columns with .loc. We can specify a range for them. See
the examples below:
In [20]: # Select rows from 0 to 200, select columns from Cement to Fine Aggreagate
concrete_df.loc [0:200,"Cement":"Fine Aggregate"]
file:///C:/Users/k/Downloads/part8_pandas-Copy1.html 8/20
2/20/24, 10:23 AM part8_pandas-Copy1
We use .iloc when we want to use only the index numbers for rows and columns. Just like
with .loc, we can specify a range.
In [21]: # Using .iloc with index numbers (Select the first 100 rows, select the first colum
concrete_df.iloc[0:100,[0]]
Out[21]: Cement
0 540.0
1 540.0
2 332.5
3 332.5
4 198.6
... ...
95 425.0
96 425.0
97 375.0
98 475.0
99 469.0
In [22]: # Using .iloc with index number (This time with multiple columns)
concrete_df.iloc[0:100,[0,2,4]]
file:///C:/Users/k/Downloads/part8_pandas-Copy1.html 9/20
2/20/24, 10:23 AM part8_pandas-Copy1
Note: .loc and .iloc behave a bit differently with ranges. .loc ranges are all
inlusive while .iloc ranges are inclusive before : and exclusive after :
file:///C:/Users/k/Downloads/part8_pandas-Copy1.html 10/20
2/20/24, 10:23 AM part8_pandas-Copy1
Note: If you use the .iloc without specifying column names after a comma, it will select all
columns
Out[24]: Blast
Fly Coarse Fine
Cement Furnace Water Superplasticizer Age Strength
Ash Aggregate Aggregate
Slag
Sorting Values
Pandas .sort_values method allows us to sort values by a column in a certain order.
For this section we will use the automobile data we used in the first section.
In [25]: # Sort the values by a column, in descending order (ascending = False), ignore the
sorted_auto = auto_df.sort_values(by='mpg', ascending=False, ignore_index=True)
sorted_auto.head(5)
file:///C:/Users/k/Downloads/part8_pandas-Copy1.html 11/20
2/20/24, 10:23 AM part8_pandas-Copy1
mazda
0 46.6 4 86.0 65.0 2110 17.9 80 3
glc
honda
1 44.6 4 91.0 67.0 1850 13.8 80 3 civic
1500 gl
vw rabbit
2 44.3 4 90.0 48.0 2085 21.7 80 2
c (diesel)
vw
3 44.0 4 97.0 52.0 2130 24.6 82 2
pickup
vw
4 43.4 4 90.0 48.0 2335 23.7 80 2 dasher
(diesel)
If we set ignore_index to False, the original row index numbers will appear.
In [26]: # Ignore_index
sorted_auto_orinx = auto_df.sort_values(by='mpg', ascending=False, ignore_index=Fal
sorted_auto_orinx.head(5)
mazda
322 46.6 4 86.0 65.0 2110 17.9 80 3
glc
honda
329 44.6 4 91.0 67.0 1850 13.8 80 3 civic
1500 gl
vw
325 44.3 4 90.0 48.0 2085 21.7 80 2 rabbit c
(diesel)
vw
394 44.0 4 97.0 52.0 2130 24.6 82 2
pickup
vw
326 43.4 4 90.0 48.0 2335 23.7 80 2 dasher
(diesel)
file:///C:/Users/k/Downloads/part8_pandas-Copy1.html 12/20
2/20/24, 10:23 AM part8_pandas-Copy1
Out[27]: model
mpg cylinders displacement horsepower weight acceleration origin car name
year
pontiac
0 13.0 8 400.0 175.0 5140 12.0 71 1
safari (sw)
chevrolet
1 11.0 8 400.0 150.0 4997 14.0 73 1
impala
dodge
2 12.0 8 383.0 180.0 4955 11.5 71 1 monaco
(sw)
mercury
3 12.0 8 429.0 198.0 4952 11.5 73 1 marquis
brougham
buick
electra
4 12.0 8 455.0 225.0 4951 11.0 73 1
225
custom
mazda
1 19.0 3 70.0 97.0 2330 13.5 72 3 rx2
coupe
maxda
2 18.0 3 70.0 90.0 2124 13.5 73 3
rx3
mazda
3 23.7 3 70.0 100.0 2420 12.5 80 3
rx-7 gs
toyota
4 32.0 4 71.0 65.0 1836 21.0 74 3 corolla
1200
file:///C:/Users/k/Downloads/part8_pandas-Copy1.html 13/20
2/20/24, 10:23 AM part8_pandas-Copy1
Out[29]: model
mpg cylinders displacement horsepower weight acceleration
year
So, what does this tell us? Let’s take a look at the horsepower column to understand better.
The count tells us that the information on horsepower has been collected from 392 cars.
There are 398 observations.
The mean tells us that the average horsepower for the cars is 104
The std (standart deviation) shows us the variety of horsepower values. A car, by
average, has 38 more or less than the mean of all horsepower values
%25 stands for the first percentile. What does it mean? It means that that the car which
has more horsepower than %25 of the cars has 75 horsepower
%50 stands for the second percentile or the median. It represents the car which has
more horsepower than %50 of the cars.
%75 stands for the third percentile. Just like the first and the second ones, it represents
the car which has more horsepower than %75 of cars.
Filtering
We can form filters with operators like ==, !=, <,>,>=,<=,& (AND), | (OR). What these
operators do is explained in the part about control flow statements.
There are two main approaches we can use. The first one (and my favorite) is to form a filter
and assign it to a variable. Then, we can use this filter variable to get a subset of the data
frame through slicing. See the example below:
In [31]: concrete_df[filter_one]
file:///C:/Users/k/Downloads/part8_pandas-Copy1.html 14/20
2/20/24, 10:23 AM part8_pandas-Copy1
Out[31]: Blast
Fly Coarse Fine
Cement Furnace Water Superplasticizer Age Strength
Ash Aggregate Aggregate
Slag
... ... ... ... ... ... ... ... ... ...
798 500.0 0.0 0.0 200.0 0.0 1125.0 613.0 270 55.16
813 310.0 0.0 0.0 192.0 0.0 970.0 850.0 180 37.33
814 310.0 0.0 0.0 192.0 0.0 970.0 850.0 360 38.11
820 525.0 0.0 0.0 189.0 0.0 1125.0 613.0 270 67.11
823 322.0 0.0 0.0 203.0 0.0 974.0 800.0 180 29.59
62 rows × 9 columns
Out[32]: Blast
Fly Coarse Fine
Cement Furnace Water Superplasticizer Age Strength
Ash Aggregate Aggregate
Slag
755 540.0 0.0 0.0 173.0 0.0 1125.0 613.0 180 71.62
756 540.0 0.0 0.0 173.0 0.0 1125.0 613.0 270 74.17
795 525.0 0.0 0.0 189.0 0.0 1125.0 613.0 180 61.92
797 500.0 0.0 0.0 200.0 0.0 1125.0 613.0 180 51.04
798 500.0 0.0 0.0 200.0 0.0 1125.0 613.0 270 55.16
820 525.0 0.0 0.0 189.0 0.0 1125.0 613.0 270 67.11
The second approach is to write filters without assigning them to variables. I don't
recommend doing this because they can look too cluttered. Also, the first approach is much
more reproducible.
file:///C:/Users/k/Downloads/part8_pandas-Copy1.html 15/20
2/20/24, 10:23 AM part8_pandas-Copy1
Out[33]: Blast
Fly Coarse Fine
Cement Furnace Water Superplasticizer Age Strength
Ash Aggregate Aggregate
Slag
Grouping --- Forming groups from a column's values. For example, we have the 'origin'
column in the automobile dataset. Every row of data tells us if the origin of the car is 1
(USA), 2 (Europe) or 3 (Asia). There are 398 rows in the dataset, meaning that there are
398 row values under the 'origin' column with a value representing one of these 3
origins. Here, we can form a group based on the origin of the automobiles. Instead of
considering them through individual row values, we can get an overview of all
automobiles organized into these 3 groups.
Aggregation --- After grouping our data, we can look at values of different columns
based on the groups we have. For example, we can look at the values of the 'weight'
column according to each group. To make things more insightful, we can use an
aggregate function on the column values we have. In the example below, we look at the
average weight based on the origin groups by using the mean aggregate function
Sum
Count
Min
Max
Mean
Median
file:///C:/Users/k/Downloads/part8_pandas-Copy1.html 16/20
2/20/24, 10:23 AM part8_pandas-Copy1
In [35]: # The mean of the weight column for each origin group
grouped_origin['weight'].mean()
origin
Out[35]:
1 3361.931727
2 2423.300000
3 2221.227848
Name: weight, dtype: float64
origin
Out[36]:
1 39.0
2 44.3
3 46.6
Name: mpg, dtype: float64
cylinders
Out[37]:
3 13.5
4 24.8
5 20.1
6 21.0
8 22.2
Name: acceleration, dtype: float64
In [38]: # Standart deviation of mpg (miles-per-gallon) for each cylinder number group
grouped_cylinder['mpg'].std()
cylinders
Out[38]:
3 2.564501
4 5.710156
5 8.228204
6 3.807322
8 2.836284
Name: mpg, dtype: float64
In [39]: # Such groupby object aggregation results can also be accessed like dataframes by u
grouped_cylinder[['mpg']].std()
Out[39]: mpg
cylinders
3 2.564501
4 5.710156
5 8.228204
6 3.807322
8 2.836284
file:///C:/Users/k/Downloads/part8_pandas-Copy1.html 17/20
2/20/24, 10:23 AM part8_pandas-Copy1
Out[40]: mpg
cylinders
3 20.550000
4 29.286765
5 27.366667
6 19.985714
8 14.963107
Out[42]: Strength
Age
1 9.455
3 15.720
7 21.650
14 26.540
28 33.760
56 51.720
90 39.680
91 67.950
100 46.985
120 39.380
180 40.905
270 51.730
360 41.685
365 42.815
The main rule we have to take into consideration here is that the column we are to add has
to have the same length (number of rows) as the rest of the data frame.
We can decide to manually fill in the column values or we can use methods and functions to
fill in the new column with processed or aggregated values. You will most likely go with the
second approach as it is more practical, faster and easier.
file:///C:/Users/k/Downloads/part8_pandas-Copy1.html 18/20
2/20/24, 10:23 AM part8_pandas-Copy1
For our example, we will add a new column to the concrete dataset, which will show the
strength/cement ratio.
Out[43]: Blast
Fly Coarse Fine
Cement Furnace Water Superplasticizer Age Strength Ra
Ash Aggregate Aggregate
Slag
... ... ... ... ... ... ... ... ... ...
Exercises
You can find the diabetes data here:
https://www.kaggle.com/datasets/akshaydattatraykhare/diabetes-dataset
Select the first 30 rows of the second, fourth and the fifth columns of the concrete
dataset
Form a data frame from the mpg, cylinders and the displacement columns of the auto-
mpg dataset
Sort the concrete data by strength in ASCENDING order, select the first 20 rows of
strength and cement columns
file:///C:/Users/k/Downloads/part8_pandas-Copy1.html 19/20
2/20/24, 10:23 AM part8_pandas-Copy1
Sort the concrete data by age in DESCENDING order, select the first 15 rows of age and
strength columns
Sort the diabetes data by glucose in DESCENDING order, select the first 12 rows of
glucose and bmi columns
Sort the auto dataset by accelaration in ASCENDING order, select the first 15 rows of
accelaration, mpg, displacement and weight columns
Patients with a glucose higher than 120 AND blood pressure higher than or equal to 68
(diabetes data)
Patients with glucose higher than 140 AND bmi lower than 27 (diabetes data)
Concrete with water higher than 200 AND age lower than 300
Automobiles with mpg rate higher than 15 AND cylinder number higher than 6
Automobiles with mpg higher than 20 AND weight lower than 4000 AND acceleration
higher than or equal to 15
Access the minimum strength values for each age group of the concrete dataset.
Access the standart deviation of weight for each origin group of the auto dataset.
file:///C:/Users/k/Downloads/part8_pandas-Copy1.html 20/20