0% found this document useful (0 votes)

58 views

Pandas Notes Basic To Advance

The document discusses loading and inspecting automobile mileage data using the Pandas library in Python. It shows how to import data from a CSV file into a Pandas DataFrame and then examine the structure of the data frame by viewing the first and last rows, column names, shape, size and other properties using methods like head(), tail(), columns, shape, size and info().

Uploaded by

amirmech9

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

58 views

Pandas Notes Basic To Advance

Uploaded by

amirmech9

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 21

Data Analysis using

Pandas Library

MANDAR PATIL
2/20/24, 10:23 AM part8_pandas-Copy1

Data Analysis with Python Pandas

by Kaan Kabalak @ witfuldata.com

Pandas is the main Python library for data science and analytics. Whether you are building a
machine learning model or just want to take a quick look at your data, you will use it. For
this part, we are going to go over the main concepts of Pandas.

Let's begin

Loading and Inspecting the Data

First of all we need to import pandas. I also imported numpy because some of its functions
can be very useful when we are analyzing data with pandas.

You can download the dataset from: https://www.kaggle.com/datasets/uciml/autompg-

dataset

In [1]: # Import
import numpy as np
import pandas as pd

We need to define a data frame with the read_csv function. This function takes a string file
path as an argument. File path is basically the path where a file is located on your system.
When I work with data, I usually put the data file in the same directory with the Python-
Jupyter Notebook file, so that just passing the file name will be sufficient. Python requires
only the name of a file if the file is in the same directory with the .py or .ipynb file you are
running.

In [2]: # Load the data from a csv file

auto_df = pd.read_csv("auto-mpg.csv")

In [3]: # Check the first 6 rows

auto_df.head(6)

file:///C:/Users/k/Downloads/part8_pandas-Copy1.html 1/20
2/20/24, 10:23 AM part8_pandas-Copy1

Out[3]: model
mpg cylinders displacement horsepower weight acceleration origin car name
year

chevrolet
0 18.0 8 307.0 130 3504 12.0 70 1 chevelle
malibu

buick
1 15.0 8 350.0 165 3693 11.5 70 1 skylark
320

plymouth
2 18.0 8 318.0 150 3436 11.0 70 1
satellite

amc rebel
3 16.0 8 304.0 150 3433 12.0 70 1
sst

ford
4 17.0 8 302.0 140 3449 10.5 70 1
torino

ford
5 15.0 8 429.0 198 4341 10.0 70 1 galaxie
500

This is a Pandas data frame. It has rows and columns. The rows have index numbers next to
them (on the left). The row index starts from 0 (like most Python objects). The columns also
have index numbers that start from 0 but they are visible to us like the row index numbers.
We will learn how to do operations with these index numbers.

Let's take a look at several aspects of our data frame.

In [4]: # Check the last 6 rows

auto_df.tail(6)

Out[4]: model car

mpg cylinders displacement horsepower weight acceleration origin
year name

chevrolet
392 27.0 4 151.0 90 2950 17.3 82 1
camaro

ford
393 27.0 4 140.0 86 2790 15.6 82 1 mustang
gl

vw
394 44.0 4 97.0 52 2130 24.6 82 2
pickup

dodge
395 32.0 4 135.0 84 2295 11.6 82 1
rampage

ford
396 28.0 4 120.0 79 2625 18.6 82 1
ranger

chevy s-
397 31.0 4 119.0 82 2720 19.4 82 1
10

In [5]: # Check the column names

auto_df.columns

Index(['mpg', 'cylinders', 'displacement', 'horsepower', 'weight',

Out[5]:
'acceleration', 'model year', 'origin', 'car name'],
dtype='object')

file:///C:/Users/k/Downloads/part8_pandas-Copy1.html 2/20
2/20/24, 10:23 AM part8_pandas-Copy1

In [6]: # You can view column names as a list

list(auto_df.columns)

['mpg',
Out[6]:
'cylinders',
'displacement',
'horsepower',
'weight',
'acceleration',
'model year',
'origin',
'car name']

In [7]: # Check the shape (number of rows, number of columns)

auto_df.shape

(398, 9)
Out[7]:

In [8]: # Check the size attribute (number of rows x number of columns)

auto_df.size

3582
Out[8]:

We can use the .info ( ) method to learn some very important things about our data frame
such as:

RangeIndex : The number of entries (rows) ad the range of their index numbers. '#' : The
index number of columns Column : Column name Non-Null Count: The number of non-null
values. Dtype : The data type of values that are held by the column. (object usually stands for
string)

In [9]: # Check the info

auto_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 398 entries, 0 to 397
Data columns (total 9 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 mpg 398 non-null float64
1 cylinders 398 non-null int64
2 displacement 398 non-null float64
3 horsepower 398 non-null object
4 weight 398 non-null int64
5 acceleration 398 non-null float64
6 model year 398 non-null int64
7 origin 398 non-null int64
8 car name 398 non-null object
dtypes: float64(3), int64(4), object(2)
memory usage: 28.1+ KB

Something seems odd here, doesn't it?

The horsepower should hold numerical (integer or float) values but it says here that it has
object (string) data type. Let's see how we can find out why and solve this issue.

We can call the unique on the column name to check all unique values.

In [10]: auto_df.horsepower.unique()

file:///C:/Users/k/Downloads/part8_pandas-Copy1.html 3/20
2/20/24, 10:23 AM part8_pandas-Copy1
array(['130', '165', '150', '140', '198', '220', '215', '225', '190',
Out[10]:
'170', '160', '95', '97', '85', '88', '46', '87', '90', '113',
'200', '210', '193', '?', '100', '105', '175', '153', '180', '110',
'72', '86', '70', '76', '65', '69', '60', '80', '54', '208', '155',
'112', '92', '145', '137', '158', '167', '94', '107', '230', '49',
'75', '91', '122', '67', '83', '78', '52', '61', '93', '148',
'129', '96', '71', '98', '115', '53', '81', '79', '120', '152',
'102', '108', '68', '58', '149', '89', '63', '48', '66', '139',
'103', '125', '133', '138', '135', '142', '77', '62', '132', '84',
'64', '74', '116', '82'], dtype=object)

Looking at these values carefully, we can see that there are entries marked with '?'. Marking
an entry like this may cause the data type of the whole column to change to string because
a Pandas data frame column can only hold one type of data. If there is a string value, then all
other values under the same column must be a string.

I will get into the details on how to solve such problems in the future. For now, let's take a
look at what we can do to easily fix this. We will use the na_values= parameter of the
read_csv function to turn '?' values into NaN (missing) values. NaN values are treated as
unidentified floats by Pandas. This will turn our column to a numeric (float, in this case) data
type. This will allow us to carry out arithmetic operations easily.

In [11]: # Pass na_values like a keyword argument and set its value to '?'
auto_df = pd.read_csv('auto-mpg.csv', na_values='?')

# Call the function method again

auto_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 398 entries, 0 to 397
Data columns (total 9 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 mpg 398 non-null float64
1 cylinders 398 non-null int64
2 displacement 398 non-null float64
3 horsepower 392 non-null float64
4 weight 398 non-null int64
5 acceleration 398 non-null float64
6 model year 398 non-null int64
7 origin 398 non-null int64
8 car name 398 non-null object
dtypes: float64(4), int64(4), object(1)
memory usage: 28.1+ KB

Here, now the horsepower column holds numerical float data as it should.

Selecting Data
For this section, we will be working with the concrete dataset.

You can download it from:

https://www.kaggle.com/datasets/prathamtripathi/regression-with-neural-networking

It is beneficial to take a look at the data dictionary before starting. Data dictionaries are
documents that explain the meaning of variables in the dataset:

file:///C:/Users/k/Downloads/part8_pandas-Copy1.html 4/20
2/20/24, 10:23 AM part8_pandas-Copy1

Compressive strength data:

"Cement" - Portland cement in kg/m3

"Blast Furnace Slag" - Blast furnace slag in kg/m3

"Fly Ash" - Fly ash in kg/m3

"Water" - Water in liters/m3

"Superplasticizer" - Superplasticizer additive in kg/m3

"Coarse Aggregate" - Coarse aggregate (gravel) in kg/m3

"Fine Aggregate" - Fine aggregate (sand) in kg/m3

"Age" - Age of the sample in days

"Strength" - Concrete compressive strength in megapascals (MPa)

In [12]: # Load the data

concrete_df = pd.read_csv("concrete_data.csv")
concrete_df.head(5)

Out[12]: Blast
Fly Coarse Fine
Cement Furnace Water Superplasticizer Age Strength
Ash Aggregate Aggregate
Slag

0 540.0 0.0 0.0 162.0 2.5 1040.0 676.0 28 79.99

1 540.0 0.0 0.0 162.0 2.5 1055.0 676.0 28 61.89

2 332.5 142.5 0.0 228.0 0.0 932.0 594.0 270 40.27

3 332.5 142.5 0.0 228.0 0.0 932.0 594.0 365 41.05

4 198.6 132.4 0.0 192.0 0.0 978.4 825.5 360 44.30

Slicing with Column Names

We can use slicing with brackets [ ] and the name of the column(s) to see the data.

In [13]: # Values of a single column

concrete_df['Cement']

0 540.0
Out[13]:
1 540.0
2 332.5
3 332.5
4 198.6
...
1025 276.4
1026 322.2
1027 148.5
1028 159.1
1029 260.9
Name: Cement, Length: 1030, dtype: float64

This shows us the row index numbers and the values. If we want to see the result like a data
frame with column names, we can use double brakcets like this:
file:///C:/Users/k/Downloads/part8_pandas-Copy1.html 5/20
2/20/24, 10:23 AM part8_pandas-Copy1

In [14]: # Values of a single column as a data frame

concrete_df[['Cement']]

Out[14]: Cement

0 540.0

1 540.0

2 332.5

3 332.5

4 198.6

... ...

1025 276.4

1026 322.2

1027 148.5

1028 159.1

1029 260.9

1030 rows × 1 columns

In [15]: # Data frame of 2 (or more) columns

concrete_df[['Cement','Age']]

Out[15]: Cement Age

0 540.0 28

1 540.0 28

2 332.5 270

3 332.5 365

4 198.6 360

... ... ...

1025 276.4 28

1026 322.2 28

1027 148.5 28

1028 159.1 28

1029 260.9 28

1030 rows × 2 columns

In [16]: # Data frame of multiple columns

concrete_df[['Blast Furnace Slag','Water','Strength']]

file:///C:/Users/k/Downloads/part8_pandas-Copy1.html 6/20
2/20/24, 10:23 AM part8_pandas-Copy1

Out[16]: Blast Furnace Slag Water Strength

0 0.0 162.0 79.99

1 0.0 162.0 61.89

2 142.5 228.0 40.27

3 142.5 228.0 41.05

4 132.4 192.0 44.30

... ... ... ...

1025 116.0 179.6 44.28

1026 0.0 196.0 31.18

1027 139.4 192.7 23.70

1028 186.7 175.6 32.77

1029 100.5 200.6 32.40

1030 rows × 3 columns

Using .loc & .iloc

We can also use .loc and .iloc to get subsets of the data.

.loc works with row index numbers and column names.

.iloc works only with row and column index numbers.

Let's take a look at how they work:

In [17]: # Using .loc to access the values of single column

concrete_df.loc[:,'Strength'] # the : means "select all rows"

0 79.99
Out[17]:
1 61.89
2 40.27
3 41.05
4 44.30
...
1025 44.28
1026 31.18
1027 23.70
1028 32.77
1029 32.40
Name: Strength, Length: 1030, dtype: float64

In [18]: # Using .loc to access the values of a single column like a data frame
concrete_df.loc[:,['Strength']]

file:///C:/Users/k/Downloads/part8_pandas-Copy1.html 7/20
2/20/24, 10:23 AM part8_pandas-Copy1

Out[18]: Strength

0 79.99

1 61.89

2 40.27

3 41.05

4 44.30

... ...

1025 44.28

1026 31.18

1027 23.70

1028 32.77

1029 32.40

1030 rows × 1 columns

In [19]: # Using .loc for a data frame from multiple columns

concrete_df.loc[:,['Cement','Water','Strength']]

Out[19]: Cement Water Strength

0 540.0 162.0 79.99

1 540.0 162.0 61.89

2 332.5 228.0 40.27

3 332.5 228.0 41.05

4 198.6 192.0 44.30

... ... ... ...

1025 276.4 179.6 44.28

1026 322.2 196.0 31.18

1027 148.5 192.7 23.70

1028 159.1 175.6 32.77

1029 260.9 200.6 32.40

1030 rows × 3 columns

We do not have to select all rows or columns with .loc. We can specify a range for them. See
the examples below:

In [20]: # Select rows from 0 to 200, select columns from Cement to Fine Aggreagate
concrete_df.loc [0:200,"Cement":"Fine Aggregate"]

file:///C:/Users/k/Downloads/part8_pandas-Copy1.html 8/20
2/20/24, 10:23 AM part8_pandas-Copy1

Out[20]: Blast Furnace Fly Coarse Fine

Cement Water Superplasticizer
Slag Ash Aggregate Aggregate

0 540.0 0.0 0.0 162.0 2.5 1040.0 676.0

1 540.0 0.0 0.0 162.0 2.5 1055.0 676.0

2 332.5 142.5 0.0 228.0 0.0 932.0 594.0

3 332.5 142.5 0.0 228.0 0.0 932.0 594.0

4 198.6 132.4 0.0 192.0 0.0 978.4 825.5

... ... ... ... ... ... ... ...

196 194.7 0.0 100.5 165.6 7.5 1006.4 905.9

197 194.7 0.0 100.5 165.6 7.5 1006.4 905.9

198 194.7 0.0 100.5 165.6 7.5 1006.4 905.9

199 190.7 0.0 125.4 162.1 7.8 1090.0 804.0

200 190.7 0.0 125.4 162.1 7.8 1090.0 804.0

201 rows × 7 columns

We use .iloc when we want to use only the index numbers for rows and columns. Just like
with .loc, we can specify a range.

In [21]: # Using .iloc with index numbers (Select the first 100 rows, select the first colum
concrete_df.iloc[0:100,[0]]

Out[21]: Cement

0 540.0

1 540.0

2 332.5

3 332.5

4 198.6

... ...

95 425.0

96 425.0

97 375.0

98 475.0

99 469.0

100 rows × 1 columns

In [22]: # Using .iloc with index number (This time with multiple columns)
concrete_df.iloc[0:100,[0,2,4]]

file:///C:/Users/k/Downloads/part8_pandas-Copy1.html 9/20
2/20/24, 10:23 AM part8_pandas-Copy1

Out[22]: Cement Fly Ash Superplasticizer

0 540.0 0.0 2.5

1 540.0 0.0 2.5

2 332.5 0.0 0.0

3 332.5 0.0 0.0

4 198.6 0.0 0.0

... ... ... ...

95 425.0 0.0 16.5

96 425.0 0.0 18.6

97 375.0 0.0 23.4

98 475.0 0.0 8.9

99 469.0 0.0 32.2

100 rows × 3 columns

Note: .loc and .iloc behave a bit differently with ranges. .loc ranges are all
inlusive while .iloc ranges are inclusive before : and exclusive after :

This means that

.loc[0:50, ['cement']] > will give you rows with index numbers from 0 to 50 (50 included)
--- A total of 51 rows
.iloc[0:50, [0]] > will give you rows with index numbers from 0 up to 50 (50 excluded) ---
A total of 50 rows

In [23]: # Rows from 0 to 150, columns from 0 to 5

concrete_df.iloc[0:150,0:5]

Out[23]: Cement Blast Furnace Slag Fly Ash Water Superplasticizer

0 540.0 0.0 0.0 162.0 2.5

1 540.0 0.0 0.0 162.0 2.5

2 332.5 142.5 0.0 228.0 0.0

3 332.5 142.5 0.0 228.0 0.0

4 198.6 132.4 0.0 192.0 0.0

... ... ... ... ... ...

145 469.0 117.2 0.0 137.8 32.2

146 425.0 106.3 0.0 153.5 16.5

147 388.6 97.1 0.0 157.9 12.1

148 531.3 0.0 0.0 141.8 28.2

149 425.0 106.3 0.0 153.5 16.5

150 rows × 5 columns

file:///C:/Users/k/Downloads/part8_pandas-Copy1.html 10/20
2/20/24, 10:23 AM part8_pandas-Copy1

Note: If you use the .iloc without specifying column names after a comma, it will select all
columns

In [24]: # Using .iloc without specifying columns

concrete_df.iloc[0:15]

Out[24]: Blast
Fly Coarse Fine
Cement Furnace Water Superplasticizer Age Strength
Ash Aggregate Aggregate
Slag

0 540.0 0.0 0.0 162.0 2.5 1040.0 676.0 28 79.99

1 540.0 0.0 0.0 162.0 2.5 1055.0 676.0 28 61.89

2 332.5 142.5 0.0 228.0 0.0 932.0 594.0 270 40.27

3 332.5 142.5 0.0 228.0 0.0 932.0 594.0 365 41.05

4 198.6 132.4 0.0 192.0 0.0 978.4 825.5 360 44.30

5 266.0 114.0 0.0 228.0 0.0 932.0 670.0 90 47.03

6 380.0 95.0 0.0 228.0 0.0 932.0 594.0 365 43.70

7 380.0 95.0 0.0 228.0 0.0 932.0 594.0 28 36.45

8 266.0 114.0 0.0 228.0 0.0 932.0 670.0 28 45.85

9 475.0 0.0 0.0 228.0 0.0 932.0 594.0 28 39.29

10 198.6 132.4 0.0 192.0 0.0 978.4 825.5 90 38.07

11 198.6 132.4 0.0 192.0 0.0 978.4 825.5 28 28.02

12 427.5 47.5 0.0 228.0 0.0 932.0 594.0 270 43.01

13 190.0 190.0 0.0 228.0 0.0 932.0 670.0 90 42.33

14 304.0 76.0 0.0 228.0 0.0 932.0 670.0 28 47.81

Sorting Values
Pandas .sort_values method allows us to sort values by a column in a certain order.

For this section we will use the automobile data we used in the first section.

Let's see some examples:

In [25]: # Sort the values by a column, in descending order (ascending = False), ignore the
sorted_auto = auto_df.sort_values(by='mpg', ascending=False, ignore_index=True)
sorted_auto.head(5)

file:///C:/Users/k/Downloads/part8_pandas-Copy1.html 11/20
2/20/24, 10:23 AM part8_pandas-Copy1

Out[25]: model car

mpg cylinders displacement horsepower weight acceleration origin
year name

mazda
0 46.6 4 86.0 65.0 2110 17.9 80 3
glc

honda
1 44.6 4 91.0 67.0 1850 13.8 80 3 civic
1500 gl

vw rabbit
2 44.3 4 90.0 48.0 2085 21.7 80 2
c (diesel)

vw
3 44.0 4 97.0 52.0 2130 24.6 82 2
pickup

vw
4 43.4 4 90.0 48.0 2335 23.7 80 2 dasher
(diesel)

If we set ignore_index to False, the original row index numbers will appear.

In [26]: # Ignore_index
sorted_auto_orinx = auto_df.sort_values(by='mpg', ascending=False, ignore_index=Fal
sorted_auto_orinx.head(5)

Out[26]: model car

mpg cylinders displacement horsepower weight acceleration origin
year name

mazda
322 46.6 4 86.0 65.0 2110 17.9 80 3
glc

honda
329 44.6 4 91.0 67.0 1850 13.8 80 3 civic
1500 gl

vw
325 44.3 4 90.0 48.0 2085 21.7 80 2 rabbit c
(diesel)

vw
394 44.0 4 97.0 52.0 2130 24.6 82 2
pickup

vw
326 43.4 4 90.0 48.0 2335 23.7 80 2 dasher
(diesel)

Let's see some different examples:

In [27]: # Sorted example

sorted_weight = auto_df.sort_values(by='weight', ascending=False, ignore_index=True
sorted_weight.head(5)

file:///C:/Users/k/Downloads/part8_pandas-Copy1.html 12/20
2/20/24, 10:23 AM part8_pandas-Copy1

Out[27]: model
mpg cylinders displacement horsepower weight acceleration origin car name
year

pontiac
0 13.0 8 400.0 175.0 5140 12.0 71 1
safari (sw)

chevrolet
1 11.0 8 400.0 150.0 4997 14.0 73 1
impala

dodge
2 12.0 8 383.0 180.0 4955 11.5 71 1 monaco
(sw)

mercury
3 12.0 8 429.0 198.0 4952 11.5 73 1 marquis
brougham

buick
electra
4 12.0 8 455.0 225.0 4951 11.0 73 1
225
custom

In [28]: # Sorted example in ascending order

sorted_displ = auto_df.sort_values(by='displacement', ascending=True, ignore_index=
sorted_displ.head(5)

Out[28]: model car

mpg cylinders displacement horsepower weight acceleration origin
year name

0 29.0 4 68.0 49.0 1867 19.5 73 2 fiat 128

mazda
1 19.0 3 70.0 97.0 2330 13.5 72 3 rx2
coupe

maxda
2 18.0 3 70.0 90.0 2124 13.5 73 3
rx3

mazda
3 23.7 3 70.0 100.0 2420 12.5 80 3
rx-7 gs

toyota
4 32.0 4 71.0 65.0 1836 21.0 74 3 corolla
1200

Descriptive Summary Statistics

We can use the .describe( ) method to see summary descriptive statistics about the columns
of our data frame:

In [29]: # Check summary statistics

auto_df.describe()

file:///C:/Users/k/Downloads/part8_pandas-Copy1.html 13/20
2/20/24, 10:23 AM part8_pandas-Copy1

Out[29]: model
mpg cylinders displacement horsepower weight acceleration
year

count 398.000000 398.000000 398.000000 392.000000 398.000000 398.000000 398.000000 3

mean 23.514573 5.454774 193.425879 104.469388 2970.424623 15.568090 76.010050

std 7.815984 1.701004 104.269838 38.491160 846.841774 2.757689 3.697627

min 9.000000 3.000000 68.000000 46.000000 1613.000000 8.000000 70.000000

25% 17.500000 4.000000 104.250000 75.000000 2223.750000 13.825000 73.000000

50% 23.000000 4.000000 148.500000 93.500000 2803.500000 15.500000 76.000000

75% 29.000000 8.000000 262.000000 126.000000 3608.000000 17.175000 79.000000

max 46.600000 8.000000 455.000000 230.000000 5140.000000 24.800000 82.000000

So, what does this tell us? Let’s take a look at the horsepower column to understand better.

The count tells us that the information on horsepower has been collected from 392 cars.
There are 398 observations.

The mean tells us that the average horsepower for the cars is 104

The std (standart deviation) shows us the variety of horsepower values. A car, by
average, has 38 more or less than the mean of all horsepower values

The min stands for the minimum value

%25 stands for the first percentile. What does it mean? It means that that the car which
has more horsepower than %25 of the cars has 75 horsepower

%50 stands for the second percentile or the median. It represents the car which has
more horsepower than %50 of the cars.

%75 stands for the third percentile. Just like the first and the second ones, it represents
the car which has more horsepower than %75 of cars.

The max stands for the maximum value

Filtering
We can form filters with operators like ==, !=, <,>,>=,<=,& (AND), | (OR). What these
operators do is explained in the part about control flow statements.

There are two main approaches we can use. The first one (and my favorite) is to form a filter
and assign it to a variable. Then, we can use this filter variable to get a subset of the data
frame through slicing. See the example below:

In [30]: # Form a filter and assign it to a variable

filter_one = concrete_df['Age'] > 100

In [31]: concrete_df[filter_one]

file:///C:/Users/k/Downloads/part8_pandas-Copy1.html 14/20
2/20/24, 10:23 AM part8_pandas-Copy1

Out[31]: Blast
Fly Coarse Fine
Cement Furnace Water Superplasticizer Age Strength
Ash Aggregate Aggregate
Slag

2 332.5 142.5 0.0 228.0 0.0 932.0 594.0 270 40.27

3 332.5 142.5 0.0 228.0 0.0 932.0 594.0 365 41.05

4 198.6 132.4 0.0 192.0 0.0 978.4 825.5 360 44.30

6 380.0 95.0 0.0 228.0 0.0 932.0 594.0 365 43.70

12 427.5 47.5 0.0 228.0 0.0 932.0 594.0 270 43.01

... ... ... ... ... ... ... ... ... ...

798 500.0 0.0 0.0 200.0 0.0 1125.0 613.0 270 55.16

813 310.0 0.0 0.0 192.0 0.0 970.0 850.0 180 37.33

814 310.0 0.0 0.0 192.0 0.0 970.0 850.0 360 38.11

820 525.0 0.0 0.0 189.0 0.0 1125.0 613.0 270 67.11

823 322.0 0.0 0.0 203.0 0.0 974.0 800.0 180 29.59

62 rows × 9 columns

In [32]: # Form a filter with multiple conditions

filter_two = (concrete_df['Age']>120) & (concrete_df['Cement']>380)
concrete_df[filter_two]

Out[32]: Blast
Fly Coarse Fine
Cement Furnace Water Superplasticizer Age Strength
Ash Aggregate Aggregate
Slag

12 427.5 47.5 0.0 228.0 0.0 932.0 594.0 270 43.01

19 475.0 0.0 0.0 228.0 0.0 932.0 594.0 180 42.62

20 427.5 47.5 0.0 228.0 0.0 932.0 594.0 180 41.84

33 475.0 0.0 0.0 228.0 0.0 932.0 594.0 270 42.13

41 427.5 47.5 0.0 228.0 0.0 932.0 594.0 365 43.70

56 475.0 0.0 0.0 228.0 0.0 932.0 594.0 365 41.93

755 540.0 0.0 0.0 173.0 0.0 1125.0 613.0 180 71.62

756 540.0 0.0 0.0 173.0 0.0 1125.0 613.0 270 74.17

795 525.0 0.0 0.0 189.0 0.0 1125.0 613.0 180 61.92

797 500.0 0.0 0.0 200.0 0.0 1125.0 613.0 180 51.04

798 500.0 0.0 0.0 200.0 0.0 1125.0 613.0 270 55.16

820 525.0 0.0 0.0 189.0 0.0 1125.0 613.0 270 67.11

The second approach is to write filters without assigning them to variables. I don't
recommend doing this because they can look too cluttered. Also, the first approach is much
more reproducible.

In [33]: # Filtering without variable assignment

concrete_df[(concrete_df['Age']>120) & (concrete_df['Blast Furnace Slag']>=140)]

file:///C:/Users/k/Downloads/part8_pandas-Copy1.html 15/20
2/20/24, 10:23 AM part8_pandas-Copy1

Out[33]: Blast
Fly Coarse Fine
Cement Furnace Water Superplasticizer Age Strength
Ash Aggregate Aggregate
Slag

2 332.5 142.5 0.0 228.0 0.0 932.0 594.0 270 40.27

3 332.5 142.5 0.0 228.0 0.0 932.0 594.0 365 41.05

23 139.6 209.4 0.0 192.0 0.0 1047.0 806.9 180 44.21

34 190.0 190.0 0.0 228.0 0.0 932.0 670.0 365 53.69

35 237.5 237.5 0.0 228.0 0.0 932.0 594.0 270 38.41

39 237.5 237.5 0.0 228.0 0.0 932.0 594.0 180 36.25

42 237.5 237.5 0.0 228.0 0.0 932.0 594.0 365 39.00

50 332.5 142.5 0.0 228.0 0.0 932.0 594.0 180 39.78

51 190.0 190.0 0.0 228.0 0.0 932.0 670.0 180 46.93

63 190.0 190.0 0.0 228.0 0.0 932.0 670.0 270 50.66

66 139.6 209.4 0.0 192.0 0.0 1047.0 806.9 360 44.70

Grouping & Aggregation

Let's understand what grouping and aggregation are:

Grouping --- Forming groups from a column's values. For example, we have the 'origin'
column in the automobile dataset. Every row of data tells us if the origin of the car is 1
(USA), 2 (Europe) or 3 (Asia). There are 398 rows in the dataset, meaning that there are
398 row values under the 'origin' column with a value representing one of these 3
origins. Here, we can form a group based on the origin of the automobiles. Instead of
considering them through individual row values, we can get an overview of all
automobiles organized into these 3 groups.

Aggregation --- After grouping our data, we can look at values of different columns
based on the groups we have. For example, we can look at the values of the 'weight'
column according to each group. To make things more insightful, we can use an
aggregate function on the column values we have. In the example below, we look at the
average weight based on the origin groups by using the mean aggregate function

Note: In mathematical computation, an aggregate function is a function that takes

multiple values as an input to produce a single output. Some of the most used aggregate
functions are:

Sum
Count
Min
Max
Mean
Median

In [34]: # Group by a column

grouped_origin = auto_df.groupby('origin')

file:///C:/Users/k/Downloads/part8_pandas-Copy1.html 16/20
2/20/24, 10:23 AM part8_pandas-Copy1

In [35]: # The mean of the weight column for each origin group
grouped_origin['weight'].mean()

origin
Out[35]:
1 3361.931727
2 2423.300000
3 2221.227848
Name: weight, dtype: float64

In [36]: # The max of mpg for each origin group

grouped_origin['mpg'].max()

origin
Out[36]:
1 39.0
2 44.3
3 46.6
Name: mpg, dtype: float64

In [37]: # Max accelaration for each cylinder number group

grouped_cylinder = auto_df.groupby('cylinders')
grouped_cylinder['acceleration'].max()

cylinders
Out[37]:
3 13.5
4 24.8
5 20.1
6 21.0
8 22.2
Name: acceleration, dtype: float64

In [38]: # Standart deviation of mpg (miles-per-gallon) for each cylinder number group
grouped_cylinder['mpg'].std()

cylinders
Out[38]:
3 2.564501
4 5.710156
5 8.228204
6 3.807322
8 2.836284
Name: mpg, dtype: float64

In [39]: # Such groupby object aggregation results can also be accessed like dataframes by u
grouped_cylinder[['mpg']].std()

Out[39]: mpg

cylinders

3 2.564501

4 5.710156

5 8.228204

6 3.807322

8 2.836284

In [40]: # The mean aggregated results like a data frame

grouped_cylinder[['mpg']].mean()

file:///C:/Users/k/Downloads/part8_pandas-Copy1.html 17/20
2/20/24, 10:23 AM part8_pandas-Copy1

Out[40]: mpg

cylinders

3 20.550000

4 29.286765

5 27.366667

6 19.985714

8 14.963107

In [41]: # Group the concrete dataset based on age

grouped_concrete = concrete_df.groupby('Age')

In [42]: # Median strength for each age group

grouped_concrete[['Strength']].median()

Out[42]: Strength

Age

1 9.455

3 15.720

7 21.650

14 26.540

28 33.760

56 51.720

90 39.680

91 67.950

100 46.985

120 39.380

180 40.905

270 51.730

360 41.685

365 42.815

Adding New Columns

Before we finish, it would be nice to take a look at how we can add new columns.

The main rule we have to take into consideration here is that the column we are to add has
to have the same length (number of rows) as the rest of the data frame.

We can decide to manually fill in the column values or we can use methods and functions to
fill in the new column with processed or aggregated values. You will most likely go with the
second approach as it is more practical, faster and easier.

file:///C:/Users/k/Downloads/part8_pandas-Copy1.html 18/20
2/20/24, 10:23 AM part8_pandas-Copy1

For our example, we will add a new column to the concrete dataset, which will show the
strength/cement ratio.

In [43]: concrete_df['RatioStrCem'] = concrete_df['Strength'] / concrete_df['Cement']

concrete_df

Out[43]: Blast
Fly Coarse Fine
Cement Furnace Water Superplasticizer Age Strength Ra
Ash Aggregate Aggregate
Slag

0 540.0 0.0 0.0 162.0 2.5 1040.0 676.0 28 79.99

1 540.0 0.0 0.0 162.0 2.5 1055.0 676.0 28 61.89

2 332.5 142.5 0.0 228.0 0.0 932.0 594.0 270 40.27

3 332.5 142.5 0.0 228.0 0.0 932.0 594.0 365 41.05

4 198.6 132.4 0.0 192.0 0.0 978.4 825.5 360 44.30

... ... ... ... ... ... ... ... ... ...

1025 276.4 116.0 90.3 179.6 8.9 870.1 768.3 28 44.28

1026 322.2 0.0 115.6 196.0 10.4 817.9 813.4 28 31.18

1027 148.5 139.4 108.6 192.7 6.1 892.4 780.0 28 23.70

1028 159.1 186.7 0.0 175.6 11.3 989.6 788.9 28 32.77

1029 260.9 100.5 78.3 200.6 8.6 864.5 761.5 28 32.40

1030 rows × 10 columns

Exercises
You can find the diabetes data here:
https://www.kaggle.com/datasets/akshaydattatraykhare/diabetes-dataset

Check the first 7 rows of the diabetes data.

How many rows does the diabetes data have?

What are the column names of the diabetes data?

What is the size of the diabetes data?

What is the shape of the concrete data?

Check the last 3 rows of the concrete data.

Select the first 30 rows of the second, fourth and the fifth columns of the concrete
dataset

Form a data frame from the mpg, cylinders and the displacement columns of the auto-
mpg dataset

Sort the concrete data by strength in ASCENDING order, select the first 20 rows of
strength and cement columns
file:///C:/Users/k/Downloads/part8_pandas-Copy1.html 19/20
2/20/24, 10:23 AM part8_pandas-Copy1

Sort the concrete data by age in DESCENDING order, select the first 15 rows of age and
strength columns

Sort the diabetes data by glucose in DESCENDING order, select the first 12 rows of
glucose and bmi columns

Sort the auto dataset by accelaration in ASCENDING order, select the first 15 rows of
accelaration, mpg, displacement and weight columns

Patients with a glucose higher than 120 AND blood pressure higher than or equal to 68
(diabetes data)

Patients with glucose higher than 140 AND bmi lower than 27 (diabetes data)

Concrete with water higher than 200 AND age lower than 300

Automobiles with mpg rate higher than 15 AND cylinder number higher than 6

Automobiles with mpg higher than 20 AND weight lower than 4000 AND acceleration
higher than or equal to 15

Automobiles with mpg higher than 25 OR acceleration higher than or equal to 20

What is the average glucose level based on diabetes outcome?

Access the minimum strength values for each age group of the concrete dataset.

Access the maximum bmi values based on diabetes outcome.

Access the standart deviation of weight for each origin group of the auto dataset.

file:///C:/Users/k/Downloads/part8_pandas-Copy1.html 20/20

Cheat Sheet: The Pandas Dataframe Object: Column Index (DF - Columns)
No ratings yet
Cheat Sheet: The Pandas Dataframe Object: Column Index (DF - Columns)
6 pages
The Little Book of Sitecore® Tips: Volume 1
From Everand
The Little Book of Sitecore® Tips: Volume 1
Neil P Shack
No ratings yet
Hospital Management System
0% (1)
Hospital Management System
23 pages
Auto Dataset MK - Part 1: Pandas PD Numpy NP
No ratings yet
Auto Dataset MK - Part 1: Pandas PD Numpy NP
18 pages
Cheat Sheet
No ratings yet
Cheat Sheet
10 pages
Pandas DataFrame Notes
100% (1)
Pandas DataFrame Notes
10 pages
Read CSV Files Using Pandas Library
No ratings yet
Read CSV Files Using Pandas Library
11 pages
Pandas DataFrameObject
No ratings yet
Pandas DataFrameObject
4 pages
Cheat Sheet: The Pandas Dataframe Object: Preliminaries Get Your Data Into A Dataframe
100% (1)
Cheat Sheet: The Pandas Dataframe Object: Preliminaries Get Your Data Into A Dataframe
10 pages
Cheat Sheet: The Pandas Dataframe Object I: Preliminaries Get Your Data Into A Dataframe
No ratings yet
Cheat Sheet: The Pandas Dataframe Object I: Preliminaries Get Your Data Into A Dataframe
12 pages
Data Analysis: Data Preparation
No ratings yet
Data Analysis: Data Preparation
9 pages
Cheat Sheet: The Pandas Dataframe Object: Preliminaries Get Your Data Into A Dataframe
100% (1)
Cheat Sheet: The Pandas Dataframe Object: Preliminaries Get Your Data Into A Dataframe
12 pages
Project 8 Predictive Analytics - Ipynb - Colaboratory
No ratings yet
Project 8 Predictive Analytics - Ipynb - Colaboratory
8 pages
Introductiontocourse: 1 The Python Programming Language: Functions
No ratings yet
Introductiontocourse: 1 The Python Programming Language: Functions
11 pages
Python Codes
No ratings yet
Python Codes
17 pages
Pandas DataFrame Notes
No ratings yet
Pandas DataFrame Notes
13 pages
12 Pandas
No ratings yet
12 Pandas
9 pages
Data Frame 100 Questions
No ratings yet
Data Frame 100 Questions
16 pages
DevOps Session 3 Pandas.pptx
No ratings yet
DevOps Session 3 Pandas.pptx
33 pages
Pandas: Import
100% (1)
Pandas: Import
13 pages
Exercise 3
No ratings yet
Exercise 3
25 pages
DA0101EN-Review-Introduction - Jupyter Notebook
No ratings yet
DA0101EN-Review-Introduction - Jupyter Notebook
8 pages
Machine Learning With Python - Part-2
No ratings yet
Machine Learning With Python - Part-2
27 pages
Data Analysis
No ratings yet
Data Analysis
58 pages
Data Wrangling PDF
No ratings yet
Data Wrangling PDF
14 pages
2 Python Data Processing
100% (2)
2 Python Data Processing
66 pages
justenoughpython_pandas_220915_175329
No ratings yet
justenoughpython_pandas_220915_175329
64 pages
Pandas Cheat Sheet........
No ratings yet
Pandas Cheat Sheet........
11 pages
week 3 python (1)
No ratings yet
week 3 python (1)
152 pages
exp1
No ratings yet
exp1
5 pages
Pandas DataFrame Notes
67% (3)
Pandas DataFrame Notes
13 pages
CHP 8 Pandas
No ratings yet
CHP 8 Pandas
49 pages
MOD-3 Dap
No ratings yet
MOD-3 Dap
41 pages
4 Data Transformation Using Pandas
No ratings yet
4 Data Transformation Using Pandas
59 pages
Pandas
No ratings yet
Pandas
41 pages
ANL252 SU4 Jul2022
No ratings yet
ANL252 SU4 Jul2022
55 pages
Pandas
No ratings yet
Pandas
21 pages
Cheat Sheet - Pandas
No ratings yet
Cheat Sheet - Pandas
12 pages
DSBDAL
No ratings yet
DSBDAL
87 pages
Data Analysis With Python
No ratings yet
Data Analysis With Python
12 pages
Data Wrangling
No ratings yet
Data Wrangling
24 pages
lec16
No ratings yet
lec16
11 pages
Exp_1_Introduction to Data Analytics and Python fundamentals_sdk_ok
No ratings yet
Exp_1_Introduction to Data Analytics and Python fundamentals_sdk_ok
9 pages
Pandas 1
No ratings yet
Pandas 1
89 pages
Data Aggregation and Group Operations
No ratings yet
Data Aggregation and Group Operations
34 pages
Week 1: 1 The Python Programming Language: Functions
No ratings yet
Week 1: 1 The Python Programming Language: Functions
9 pages
More On Pandas
No ratings yet
More On Pandas
51 pages
Usage of NumPy for Numerical Data in Detail
No ratings yet
Usage of NumPy for Numerical Data in Detail
52 pages
Unit6 - Working With Data
No ratings yet
Unit6 - Working With Data
29 pages
Pandas
No ratings yet
Pandas
41 pages
exp3 python (1)
No ratings yet
exp3 python (1)
15 pages
Lec3_PandasDataframes_2
No ratings yet
Lec3_PandasDataframes_2
16 pages
Chapter 4 - Python For Data Analysis
No ratings yet
Chapter 4 - Python For Data Analysis
47 pages
Python Basics Refresher
No ratings yet
Python Basics Refresher
19 pages
CH-6 Data Loading, Storage, and File Formats
No ratings yet
CH-6 Data Loading, Storage, and File Formats
163 pages
Pandas 1
No ratings yet
Pandas 1
32 pages
Pandas Course Slides
No ratings yet
Pandas Course Slides
90 pages
Untitled 0
No ratings yet
Untitled 0
3 pages
Learning Robotics Using Python
From Everand
Learning Robotics Using Python
Lentin Joseph
No ratings yet
Django 1.0 Template Development
From Everand
Django 1.0 Template Development
Scott Newman
No ratings yet
New Hemi Engines 2003 to Present: How to Build Max Performance
From Everand
New Hemi Engines 2003 to Present: How to Build Max Performance
Larry Shepard
No ratings yet
Chapter 32 Fades and Crossfades
No ratings yet
Chapter 32 Fades and Crossfades
50 pages
Wearable Computers: An Overview: Tara Kieffner
No ratings yet
Wearable Computers: An Overview: Tara Kieffner
20 pages
Tópico 15 - Texto - Be Web Wise
No ratings yet
Tópico 15 - Texto - Be Web Wise
1 page
Cambridge IGCSE™: Additional Mathematics 0606/21 May/June 2021
No ratings yet
Cambridge IGCSE™: Additional Mathematics 0606/21 May/June 2021
11 pages
Effects of New Media To The Filipino Youth
No ratings yet
Effects of New Media To The Filipino Youth
2 pages
PG Success Story Woodside Energy PDF
No ratings yet
PG Success Story Woodside Energy PDF
3 pages
Mobile App Success Prediction
No ratings yet
Mobile App Success Prediction
8 pages
Sound Forge Power Ebook
No ratings yet
Sound Forge Power Ebook
388 pages
(Lecture Notes in Applied and Computational Mechanics 19) Paul B. MacCready (Auth.), Rose McCallen PH.D., Fred Browand PH.D., Dr. James Ross Ph.D. (Eds.) - The Aerodynamics of Heavy Vehicles - Trucks
No ratings yet
(Lecture Notes in Applied and Computational Mechanics 19) Paul B. MacCready (Auth.), Rose McCallen PH.D., Fred Browand PH.D., Dr. James Ross Ph.D. (Eds.) - The Aerodynamics of Heavy Vehicles - Trucks
534 pages
Mean of Discrete Variable
No ratings yet
Mean of Discrete Variable
26 pages
Digital Communications Etiquette
No ratings yet
Digital Communications Etiquette
27 pages
P04 The Regression Pipeline - Preprocessing Ans
No ratings yet
P04 The Regression Pipeline - Preprocessing Ans
19 pages
Learner'S Activity Sheets (Las) : Objective
No ratings yet
Learner'S Activity Sheets (Las) : Objective
5 pages
AR AP Config
No ratings yet
AR AP Config
163 pages
8051 Microcontroller Unit 4
No ratings yet
8051 Microcontroller Unit 4
49 pages
Applications of Compiler Design
No ratings yet
Applications of Compiler Design
2 pages
Using The Compute Module Provisioner
No ratings yet
Using The Compute Module Provisioner
18 pages
June 2022 MS - Paper 2 OCR Computer Science AS-level
No ratings yet
June 2022 MS - Paper 2 OCR Computer Science AS-level
17 pages
Grade 7 Complete Material Computer SC
No ratings yet
Grade 7 Complete Material Computer SC
12 pages
Abs 413
No ratings yet
Abs 413
97 pages
ProtoNode Startup Guide For Eaton Cooper
No ratings yet
ProtoNode Startup Guide For Eaton Cooper
60 pages
Axiomatic Brochure
100% (1)
Axiomatic Brochure
20 pages
LOINC Toolkit Release Note
No ratings yet
LOINC Toolkit Release Note
2 pages
Group 5 - Laboratory Report - Experiment 4
No ratings yet
Group 5 - Laboratory Report - Experiment 4
8 pages
LAB5
No ratings yet
LAB5
5 pages
ccccctv-module-cctv-1 (1)
No ratings yet
ccccctv-module-cctv-1 (1)
37 pages
Std06-II-Science-EM - WWW - Tntextbooks.in PDF
No ratings yet
Std06-II-Science-EM - WWW - Tntextbooks.in PDF
100 pages
02 HackSpace
100% (3)
02 HackSpace
118 pages
2851
No ratings yet
2851
50 pages