0% found this document useful (0 votes)
3 views

pandas.ipynb - Colab

The document is a Jupyter notebook that demonstrates the use of the pandas library for data manipulation in Python. It covers creating and manipulating Series and DataFrames, handling missing data, and performing operations like sorting and filtering. The notebook includes practical examples and outputs to illustrate the functionalities of pandas.

Uploaded by

Mohit Kumar
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
3 views

pandas.ipynb - Colab

The document is a Jupyter notebook that demonstrates the use of the pandas library for data manipulation in Python. It covers creating and manipulating Series and DataFrames, handling missing data, and performing operations like sorting and filtering. The notebook includes practical examples and outputs to illustrate the functionalities of pandas.

Uploaded by

Mohit Kumar
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 22

27/04/2024, 14:23 pandas.

ipynb - Colab

import pandas
pandas.__version__

'2.0.3'

import pandas as pd

data = pd.Series([0.25, 0.5, 0.75, 1.0])


data

0 0.25
1 0.50
2 0.75
3 1.00
dtype: float64

data.values

array([0.25, 0.5 , 0.75, 1. ])

data.index

output RangeIndex(start=0, stop=4, step=1)

data[1]

0.5

data[1:3]

1 0.50
2 0.75
dtype: float64

df = pd.DataFrame({
'name' : ['Bob', 'Jen', 'Tim' ],
'age' : [20,30,40],
'pet' : ['cat', 'dog', 'bird']
})
df

name age pet

0 Bob 20 cat

1 Jen 30 dog

2 Tim 40 bird

data = pd.Series(['a','b','c'], index = [1,3,5])


data

1 a
3 b
5 c
dtype: object

data[1]

'a'

data[1:3]

3 b
5 c
dtype: object

data.loc[1]

'a'

data.loc[1:3]

1 a
3 b
dtype: object

https://colab.research.google.com/drive/1qhtF1UDqi1b5pB3ZB8MJZo5eH3RerI7L#printMode=true 1/22
27/04/2024, 14:23 pandas.ipynb - Colab

data

1 a
3 b
5 c
dtype: object

data.iloc[0] # giving value at zeroth index

'a'

data.iloc[1:3]

3 b
5 c
dtype: object

df

name age pet

0 Bob 20 cat

1 Jen 30 dog

2 Tim 40 bird

df.sort_values('pet', inplace= True)

df

name age pet

2 Tim 40 bird

0 Bob 20 cat

1 Jen 30 dog

df.loc[0]

name Bob
age 20
pet cat
Name: 0, dtype: object

df.iloc[0]

name Tim
age 40
pet bird
Name: 2, dtype: object

df.iloc[:,2] # name is 0th, age is 1 and pet is 2 so for pet we have 2 as parameter

2 bird
0 cat
1 dog
Name: pet, dtype: object

df.iloc[-1, :] # -1 = last row and : = all the columns

name Jen
age 30
pet dog
Name: 1, dtype: object

area = pd.Series({'California': 423967, 'Texas': 695662,


'New York': 141297, 'Florida': 170312,
'Illinois': 149995})
pop = pd.Series({'California': 38332521, 'Texas': 26448193,
'New York': 19651127, 'Florida': 19552860,
'Illinois': 12882135})
data = pd.DataFrame({'area':area, 'pop':pop})
data

https://colab.research.google.com/drive/1qhtF1UDqi1b5pB3ZB8MJZo5eH3RerI7L#printMode=true 2/22
27/04/2024, 14:23 pandas.ipynb - Colab

area pop

California 423967 38332521

Texas 695662 26448193

New York 141297 19651127

Florida 170312 19552860

Illinois 149995 12882135

area

California 423967
Texas 695662
New York 141297
Florida 170312
Illinois 149995
dtype: int64

data['area']

California 423967
Texas 695662
New York 141297
Florida 170312
Illinois 149995
Name: area, dtype: int64

data.area

California 423967
Texas 695662
New York 141297
Florida 170312
Illinois 149995
Name: area, dtype: int64

data['density'] = data['pop']/data['area']

data

area pop density

California 423967 38332521 90.413926

Texas 695662 26448193 38.018740

New York 141297 19651127 139.076746

Florida 170312 19552860 114.806121

Illinois 149995 12882135 85.883763

data.values

array([[4.23967000e+05, 3.83325210e+07, 9.04139261e+01],


[6.95662000e+05, 2.64481930e+07, 3.80187404e+01],
[1.41297000e+05, 1.96511270e+07, 1.39076746e+02],
[1.70312000e+05, 1.95528600e+07, 1.14806121e+02],
[1.49995000e+05, 1.28821350e+07, 8.58837628e+01]])

data.T

California Texas New York Florida Illinois

area 4.239670e+05 6.956620e+05 1.412970e+05 1.703120e+05 1.499950e+05

pop 3.833252e+07 2.644819e+07 1.965113e+07 1.955286e+07 1.288214e+07

density 9.041393e+01 3.801874e+01 1.390767e+02 1.148061e+02 8.588376e+01

data.values[0]

array([4.23967000e+05, 3.83325210e+07, 9.04139261e+01])

data.iloc[:3,:2]

https://colab.research.google.com/drive/1qhtF1UDqi1b5pB3ZB8MJZo5eH3RerI7L#printMode=true 3/22
27/04/2024, 14:23 pandas.ipynb - Colab

area pop

California 423967 38332521

Texas 695662 26448193

New York 141297 19651127

data.loc[:,:'pop']

area pop

California 423967 38332521

Texas 695662 26448193

New York 141297 19651127

Florida 170312 19552860

Illinois 149995 12882135

data.loc[data.density>100,['pop','density']]

pop density

New York 19651127 139.076746

Florida 19552860 114.806121

data

area pop density

California 423967 38332521 90.413926

Texas 695662 26448193 38.018740

New York 141297 19651127 139.076746

Florida 170312 19552860 114.806121

Illinois 149995 12882135 85.883763

data.iloc[0,2]

90.41392608386974

data.iloc[0,2] = 90
data

area pop density

California 423967 38332521 90.000000

Texas 695662 26448193 38.018740

New York 141297 19651127 139.076746

Florida 170312 19552860 114.806121

Illinois 149995 12882135 85.883763

data['Florida':'Illinois']

area pop density

Florida 170312 19552860 114.806121

Illinois 149995 12882135 85.883763

data[1:3]

area pop density

Texas 695662 26448193 38.018740

New York 141297 19651127 139.076746

data[data['density']>100]

https://colab.research.google.com/drive/1qhtF1UDqi1b5pB3ZB8MJZo5eH3RerI7L#printMode=true 4/22
27/04/2024, 14:23 pandas.ipynb - Colab

area pop density

New York 141297 19651127 139.076746

Florida 170312 19552860 114.806121

import numpy as np

data = pd.Series([1, np.nan, 'hello', None])


data

0 1
1 NaN
2 hello
3 None
dtype: object

data.isnull()

0 False
1 True
2 False
3 True
dtype: bool

data[data.notnull()]

0 1
2 hello
dtype: object

data.dropna()

0 1
2 hello
dtype: object

df = pd.DataFrame([[1, np.nan, 2],


[2, 3, 5],
[np.nan, 4, 6]])
df

0 1 2

0 1.0 NaN 2

1 2.0 3.0 5

2 NaN 4.0 6

df.dropna()

0 1 2

1 2.0 3.0 5

df.dropna(axis='columns')

0 2

1 5

2 6

df.dropna(axis=1)

0 2

1 5

2 6

https://colab.research.google.com/drive/1qhtF1UDqi1b5pB3ZB8MJZo5eH3RerI7L#printMode=true 5/22
27/04/2024, 14:23 pandas.ipynb - Colab
df[3] = np.nan
df

0 1 2 3

0 1.0 NaN 2 NaN

1 2.0 3.0 5 NaN

2 NaN 4.0 6 NaN

df.dropna(axis='columns', how='all') # how = all drops the column/row with all the values as null

0 1 2

0 1.0 NaN 2

1 2.0 3.0 5

2 NaN 4.0 6

df.dropna(axis='rows', thresh=3) # thresh parameter lets you specify a minimum number of non-null values for the row/column

0 1 2 3

1 2.0 3.0 5 NaN

data = pd.Series([1, np.nan, 2, None, 3], index=list('abcde'))


data

a 1.0
b NaN
c 2.0
d NaN
e 3.0
dtype: float64

data.fillna(0)

a 1.0
b 0.0
c 2.0
d 0.0
e 3.0
dtype: float64

data.fillna(method='ffill')

a 1.0
b 1.0
c 2.0
d 2.0
e 3.0
dtype: float64

data.fillna(method='bfill')

a 1.0
b 2.0
c 2.0
d 3.0
e 3.0
dtype: float64

df

0 1 2 3

0 1.0 NaN 2 NaN

1 2.0 3.0 5 NaN

2 NaN 4.0 6 NaN

df.fillna(method='ffill', axis=1)

https://colab.research.google.com/drive/1qhtF1UDqi1b5pB3ZB8MJZo5eH3RerI7L#printMode=true 6/22
27/04/2024, 14:23 pandas.ipynb - Colab

0 1 2 3

0 1.0 1.0 2.0 2.0

1 2.0 3.0 5.0 5.0

2 NaN 4.0 6.0 6.0

df.fillna(0)

0 1 2 3

0 1.0 0.0 2 0.0

1 2.0 3.0 5 0.0

2 0.0 4.0 6 0.0

df

0 1 2 3

0 1.0 NaN 2 NaN

1 2.0 3.0 5 NaN

2 NaN 4.0 6 NaN

df[[1]].fillna(0)

0 0.0

1 3.0

2 4.0

data = pd.read_csv('titanic.csv')

data

passengerid survived pclass name sex age sibsp parch ticket fare cabin embarked

0 1 0 3 Braund, Mr. Owen Harris male 22.0 1 0 A/5 21171 7.2500 NaN S

Cumings, Mrs. John Bradley


1 2 1 1 female 38.0 1 0 PC 17599 71.2833 C85 C
(Florence Briggs Th...

STON/O2.
2 3 1 3 Heikkinen, Miss. Laina female 26.0 0 0 7.9250 NaN S
3101282

Futrelle, Mrs. Jacques


3 4 1 1 female 35.0 1 0 113803 53.1000 C123 S
Heath (Lily May Peel)

4 5 0 3 Allen, Mr. William Henry male 35.0 0 0 373450 8.0500 NaN S

... ... ... ... ... ... ... ... ... ... ... ... ...

886 887 0 2 Montvila, Rev. Juozas male 27.0 0 0 211536 13.0000 NaN S

Graham, Miss. Margaret


887 888 1 1 female 19.0 0 0 112053 30.0000 B42 S
Edith

Johnston, Miss. Catherine


888 889 0 3 female NaN 1 2 W./C. 6607 23.4500 NaN S
Helen "Carrie"

889 890 1 1 Behr, Mr. Karl Howell male 26.0 0 0 111369 30.0000 C148 C

890 891 0 3 Dooley Mr Patrick male 32 0 0 0 370376 7 7500 NaN Q

data.head()

https://colab.research.google.com/drive/1qhtF1UDqi1b5pB3ZB8MJZo5eH3RerI7L#printMode=true 7/22
27/04/2024, 14:23 pandas.ipynb - Colab

passengerid survived pclass name sex age sibsp parch ticket fare cabin embarked

0 1 0 3 Braund, Mr. Owen Harris male 22.0 1 0 A/5 21171 7.2500 NaN S

Cumings, Mrs. John Bradley


1 2 1 1 female 38.0 1 0 PC 17599 71.2833 C85 C
(Florence Briggs Th...

STON/O2.
2 3 1 3 Heikkinen, Miss. Laina female 26.0 0 0 7.9250 NaN S
3101282

Futrelle, Mrs. Jacques Heath


3 4 1 1 female 35.0 1 0 113803 53.1000 C123 S
(Lily May Peel)

4 5 0 3 Allen, Mr. William Henry male 35.0 0 0 373450 8.0500 NaN S

data.tail(10)

passengerid survived pclass name sex age sibsp parch ticket fare cabin embarked

881 882 0 3 Markun, Mr. Johann male 33.0 0 0 349257 7.8958 NaN S

Dahlberg, Miss. Gerda


882 883 0 3 female 22.0 0 0 7552 10.5167 NaN S
Ulrika

Banfield, Mr. Frederick C.A./SOTON


883 884 0 2 male 28.0 0 0 10.5000 NaN S
James 34068

SOTON/OQ
884 885 0 3 Sutehall, Mr. Henry Jr male 25.0 0 0 7.0500 NaN S
392076

Rice, Mrs. William


885 886 0 3 female 39.0 0 5 382652 29.1250 NaN Q
(Margaret Norton)

886 887 0 2 Montvila, Rev. Juozas male 27.0 0 0 211536 13.0000 NaN S

Graham, Miss. Margaret


887 888 1 1 female 19.0 0 0 112053 30.0000 B42 S
Edith

Johnston Miss Catherine


f l N N W /C N N S
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 12 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 passengerid 891 non-null int64
1 survived 891 non-null int64
2 pclass 891 non-null int64
3 name 891 non-null object
4 sex 891 non-null object
5 age 714 non-null float64
6 sibsp 891 non-null int64
7 parch 891 non-null int64
8 ticket 891 non-null object
9 fare 891 non-null float64
10 cabin 204 non-null object
11 embarked 889 non-null object
dtypes: float64(2), int64(5), object(5)
memory usage: 83.7+ KB

data.describe().T

count mean std min 25% 50% 75% max

passengerid 891.0 446.000000 257.353842 1.00 223.5000 446.0000 668.5 891.0000

survived 891.0 0.383838 0.486592 0.00 0.0000 0.0000 1.0 1.0000

pclass 891.0 2.308642 0.836071 1.00 2.0000 3.0000 3.0 3.0000

age 714.0 29.699118 14.526497 0.42 20.1250 28.0000 38.0 80.0000

sibsp 891.0 0.523008 1.102743 0.00 0.0000 0.0000 1.0 8.0000

parch 891.0 0.381594 0.806057 0.00 0.0000 0.0000 0.0 6.0000

fare 891.0 32.204208 49.693429 0.00 7.9104 14.4542 31.0 512.3292

data[data.sex=='male']

https://colab.research.google.com/drive/1qhtF1UDqi1b5pB3ZB8MJZo5eH3RerI7L#printMode=true 8/22
27/04/2024, 14:23 pandas.ipynb - Colab

passengerid survived pclass name sex age sibsp parch ticket fare cabin embarked

0 1 0 3 Braund, Mr. Owen Harris male 22.0 1 0 A/5 21171 7.2500 NaN S

4 5 0 3 Allen, Mr. William Henry male 35.0 0 0 373450 8.0500 NaN S

5 6 0 3 Moran, Mr. James male NaN 0 0 330877 8.4583 NaN Q

6 7 0 1 McCarthy, Mr. Timothy J male 54.0 0 0 17463 51.8625 E46 S

Palsson, Master. Gosta


7 8 0 3 male 2.0 3 1 349909 21.0750 NaN S
Leonard

... ... ... ... ... ... ... ... ... ... ... ... ...

Banfield, Mr. Frederick C.A./SOTON


883 884 0 2 male 28.0 0 0 10.5000 NaN S
James 34068

SOTON/OQ
884 885 0 3 Sutehall, Mr. Henry Jr male 25.0 0 0 7.0500 NaN S
392076

886 887 0 2 Montvila, Rev. Juozas male 27.0 0 0 211536 13.0000 NaN S

889 890 1 1 Behr, Mr. Karl Howell male 26.0 0 0 111369 30.0000 C148 C

890 891 0 3 Dooley, Mr. Patrick male 32.0 0 0 370376 7.7500 NaN Q

data.age[data.sex=='male']

0 22.0
4 35.0
5 NaN
6 54.0
7 2.0
...
883 28.0
884 25.0
886 27.0
889 26.0
890 32.0
Name: age, Length: 577, dtype: float64

How many men and women were on the Titanic?

data[data.sex=='male'].count()

passengerid 577
survived 577
pclass 577
name 577
sex 577
age 453
sibsp 577
parch 577
ticket 577
fare 577
cabin 107
embarked 577
dtype: int64

data.sex[data.sex=='male'].count()

577

data[(data.sex=='male') & (data.age>=18)]

https://colab.research.google.com/drive/1qhtF1UDqi1b5pB3ZB8MJZo5eH3RerI7L#printMode=true 9/22
27/04/2024, 14:23 pandas.ipynb - Colab

passengerid survived pclass name sex age sibsp parch ticket fare cabin embarked

0 1 0 3 Braund, Mr. Owen Harris male 22.0 1 0 A/5 21171 7.2500 NaN S

4 5 0 3 Allen, Mr. William Henry male 35.0 0 0 373450 8.0500 NaN S

6 7 0 1 McCarthy, Mr. Timothy J male 54.0 0 0 17463 51.8625 E46 S

Saundercock, Mr. William


12 13 0 3 male 20.0 0 0 A/5. 2151 8.0500 NaN S
Henry

Andersson, Mr. Anders


13 14 0 3 male 39.0 1 5 347082 31.2750 NaN S
Johan

... ... ... ... ... ... ... ... ... ... ... ... ...

Banfield, Mr. Frederick C.A./SOTON


883 884 0 2 male 28.0 0 0 10.5000 NaN S
James 34068

SOTON/OQ
884 885 0 3 Sutehall, Mr. Henry Jr male 25.0 0 0 7.0500 NaN S
392076

886 887 0 2 Montvila, Rev. Juozas male 27.0 0 0 211536 13.0000 NaN S

889 890 1 1 Behr, Mr. Karl Howell male 26.0 0 0 111369 30.0000 C148 C

What was the survival rate for adult men (age>=18)

data.survived[(data.sex=='male') & (data.age>=18)].mean()

0.17721518987341772

What was the survival rate for women and children?

data.survived[(data.sex=='female') | (data.age<18)].mean()

0.6881720430107527

Use groupby to compare the survival rates of men and women

data.groupby('sex')['survived'].mean()

sex
female 0.742038
male 0.188908
Name: survived, dtype: float64

new = data.groupby(['sex','pclass'])['survived'].mean()
new

sex pclass
female 1 0.968085
2 0.921053
3 0.500000
male 1 0.368852
2 0.157407
3 0.135447
Name: survived, dtype: float64

import seaborn as sns


planets = sns.load_dataset('planets')
planets.shape

(1035, 6)

planets.head(5)

method number orbital_period mass distance year

0 Radial Velocity 1 269.300 7.10 77.40 2006

1 Radial Velocity 1 874.774 2.21 56.95 2008

2 Radial Velocity 1 763.000 2.60 19.84 2011

3 Radial Velocity 1 326.030 19.40 110.62 2007

4 Radial Velocity 1 516.220 10.50 119.47 2009

https://colab.research.google.com/drive/1qhtF1UDqi1b5pB3ZB8MJZo5eH3RerI7L#printMode=true 10/22
27/04/2024, 14:23 pandas.ipynb - Colab
df = pd.DataFrame({'key': ['A', 'B', 'C', 'A', 'B', 'C'],
'data': range(6)}, columns=['key', 'data'])
df

key data

0 A 0

1 B 1

2 C 2

3 A 3

4 B 4

5 C 5

df.groupby('key')['data'].sum()

key
A 3
B 5
C 7
Name: data, dtype: int64

planets.groupby('method').median()

number orbital_period mass distance year

method

Astrometry 1.0 631.180000 NaN 17.875 2011.5

Eclipse Timing Variations 2.0 4343.500000 5.125 315.360 2010.0

Imaging 1.0 27500.000000 NaN 40.395 2009.0

Microlensing 1.0 3300.000000 NaN 3840.000 2010.0

Orbital Brightness Modulation 2.0 0.342887 NaN 1180.000 2011.0

Pulsar Timing 3.0 66.541900 NaN 1200.000 1994.0

Pulsation Timing Variations 1.0 1170.000000 NaN NaN 2007.0

Radial Velocity 1.0 360.200000 1.260 40.445 2009.0

Transit 1.0 5.714932 1.470 341.000 2012.0

Transit Timing Variations 2.0 57.011000 NaN 855.000 2012.5

planets.groupby('method')['orbital_period'].median()

method
Astrometry 631.180000
Eclipse Timing Variations 4343.500000
Imaging 27500.000000
Microlensing 3300.000000
Orbital Brightness Modulation 0.342887
Pulsar Timing 66.541900
Pulsation Timing Variations 1170.000000
Radial Velocity 360.200000
Transit 5.714932
Transit Timing Variations 57.011000
Name: orbital_period, dtype: float64

class display(object):
"""Display HTML representation of multiple objects"""
template = """<div style="float: left; padding: 10px;">
<p style='font-family:"Courier New", Courier, monospace'>{0}</p>{1}
</div>"""
def __init__(self, *args):
self.args = args

def _repr_html_(self):
return '\n'.join(self.template.format(a, eval(a)._repr_html_())
for a in self.args)

def __repr__(self):
return '\n\n'.join(a + '\n' + repr(eval(a))
for a in self.args)

https://colab.research.google.com/drive/1qhtF1UDqi1b5pB3ZB8MJZo5eH3RerI7L#printMode=true 11/22
27/04/2024, 14:23 pandas.ipynb - Colab
rng = np.random.RandomState(0)
df = pd.DataFrame({'key': ['A', 'B', 'C', 'A', 'B', 'C'],
'data1': range(6),
'data2': rng.randint(0, 10, 6)},
columns = ['key', 'data1', 'data2'])
df

key data1 data2

0 A 0 5

1 B 1 0

2 C 2 3

3 A 3 3

4 B 4 7

5 C 5 9

df.groupby('key')['data1'].median()

key
A 1.5
B 2.5
C 3.5
Name: data1, dtype: float64

df.groupby('key').aggregate(['min', np.median, max])

data1 data2

min median max min median max

key

A 0 1.5 3 3 4.0 5

B 1 2.5 4 0 3.5 7

C 2 3.5 5 3 6.0 9

df.groupby('key').aggregate({'data1': ['min','mean'],
'data2': 'max'})

data1 data2

min mean max

key

A 0 1.5 5

B 1 2.5 7

C 2 3.5 9

df

key data1 data2

0 A 0 5

1 B 1 0

2 C 2 3

3 A 3 3

4 B 4 7

5 C 5 9

df.groupby('key').std()

data1 data2

key

A 2.12132 1.414214

B 2.12132 4.949747

C 2.12132 4.242641

https://colab.research.google.com/drive/1qhtF1UDqi1b5pB3ZB8MJZo5eH3RerI7L#printMode=true 12/22
27/04/2024, 14:23 pandas.ipynb - Colab

def filter_func(x):
return x['data2'].std() > 4

display('df', "df.groupby('key').std()", "df.groupby('key').filter(filter_func)")

df df.groupby('key').std() df.groupby('key').filter(filter_func)

key data1 data2 data1 data2 key data1 data2

0 A 0 5 key 1 B 1 0

1 B 1 0 A 2.12132 1.414214 2 C 2 3

2 C 2 3 B 2.12132 4.949747 4 B 4 7

3 A 3 3 C 2.12132 4.242641 5 C 5 9

4 B 4 7

5 C 5 9

df.groupby('key').transform(lambda x: x - x.mean())

#Mean
#data1: A:1.5, B:2.5, C: 3.5
#data2: A:4, B: 3.5, C: 12

#0: -1.5, 1
#1: (1-2.5 = -1.5), (0-3.5=-3.5)

data1 data2

0 -1.5 1.0

1 -1.5 -3.5

2 -1.5 -3.0

3 1.5 -1.0

4 1.5 3.5

5 1.5 3.0

# df['col1] = df.data1 - df.data1.mean() for adding that as a new column


df.data1 - df.data1.mean()

0 -2.5
1 -1.5
2 -0.5
3 0.5
4 1.5
5 2.5
Name: data1, dtype: float64

def norm_by_data2(x):
# x is a DataFrame of group values
x['data1'] /= x['data2'].sum()
return x

display('df', "df.groupby('key').apply(norm_by_data2)")

#1: data1: 1/(0+7) = .1428


#2: data1: 2/(3+9) = .1666

df df.groupby('key').apply(norm_by_data2)

key data1 data2 key data1 data2

0 A 0 5 key

1 B 1 0 A 0 A 0.000000 5

2 C 2 3 3 A 0.375000 3

3 A 3 3 B 1 B 0.142857 0

4 B 4 7 4 B 0.571429 7

5 C 5 9 C 2 C 0.166667 3

5 C 0.416667 9

https://colab.research.google.com/drive/1qhtF1UDqi1b5pB3ZB8MJZo5eH3RerI7L#printMode=true 13/22
27/04/2024, 14:23 pandas.ipynb - Colab

L = [0, 1, 0, 1, 2, 0]
display('df', 'df.groupby(L).sum()')

#0: data1: 0+2+5 = 7


#0: data2: 5+3+9 = 17
#1: data1: 1+3 = 4
#1: data2: 0+3 =3
#2: data1: 4
#2: data2: 7

df df.groupby(L).sum()

key data1 data2 key data1 data2

0 A 0 5 0 ACC 7 17

1 B 1 0 1 BA 4 3

2 C 2 3 2 B 4 7

3 A 3 3

4 B 4 7

5 C 5 9

df2 = df.set_index('key')
df2

data1 data2

key

A 0 5

B 1 0

C 2 3

A 3 3

B 4 7

C 5 9

display('df2', 'df2.groupby(str.lower).mean()')

df2 df2.groupby(str.lower).mean()

data1 data2 data1 data2

key key

A 0 5 a 1.5 4.0

B 1 0 b 2.5 3.5

C 2 3 c 3.5 6.0

A 3 3

B 4 7

C 5 9

df1 = pd.DataFrame({'employee': ['Bob', 'Jake', 'Lisa', 'Sue'],


'group': ['Accounting', 'Engineering', 'Engineering', 'HR']})
df2 = pd.DataFrame({'employee': ['Lisa', 'Bob', 'Jake', 'Sue'],
'hire_date': [2004, 2008, 2012, 2014]})
display('df1', 'df2')

df1 df2

employee group employee hire_date

0 Bob Accounting 0 Lisa 2004

1 Jake Engineering 1 Bob 2008

2 Lisa Engineering 2 Jake 2012

3 Sue HR 3 Sue 2014

https://colab.research.google.com/drive/1qhtF1UDqi1b5pB3ZB8MJZo5eH3RerI7L#printMode=true 14/22
27/04/2024, 14:23 pandas.ipynb - Colab
df3 = pd.merge(df1, df2)
df3

employee group hire_date

0 Bob Accounting 2008

1 Jake Engineering 2012

2 Lisa Engineering 2004

3 Sue HR 2014

df4 = pd.DataFrame({'group': ['Accounting', 'Engineering', 'HR'],


'supervisor': ['Carly', 'Guido', 'Steve']})
display('df3', 'df4', 'pd.merge(df3, df4)')

df3 df4 pd.merge(df3, df4)

employee group hire_date group supervisor employee group hire_date supervisor

0 Bob Accounting 2008 0 Accounting Carly 0 Bob Accounting 2008 Carly

1 Jake Engineering 2012 1 Engineering Guido 1 Jake Engineering 2012 Guido

2 Lisa Engineering 2004 2 HR Steve 2 Lisa Engineering 2004 Guido

3 Sue HR 2014 3 Sue HR 2014 Steve

df5 = pd.DataFrame({'group': ['Accounting', 'Accounting',


'Engineering', 'Engineering', 'HR', 'HR'],
'skills': ['math', 'spreadsheets', 'coding', 'linux',
'spreadsheets', 'organization']})
display('df1', 'df5', "pd.merge(df1, df5)")

df1 df5 pd.merge(df1, df5)

employee group group skills employee group skills

0 Bob Accounting 0 Accounting math 0 Bob Accounting math

1 Jake Engineering 1 Accounting spreadsheets 1 Bob Accounting spreadsheets

2 Lisa Engineering 2 Engineering coding 2 Jake Engineering coding

3 Sue HR 3 Engineering linux 3 Jake Engineering linux

4 HR spreadsheets 4 Lisa Engineering coding

5 HR organization 5 Lisa Engineering linux

6 Sue HR spreadsheets

7 Sue HR organization

display('df1', 'df2', "pd.merge(df1, df2, on='employee')")

df1 df2 pd.merge(df1, df2, on='employee')

employee group employee hire_date employee group hire_date

0 Bob Accounting 0 Lisa 2004 0 Bob Accounting 2008

1 Jake Engineering 1 Bob 2008 1 Jake Engineering 2012

2 Lisa Engineering 2 Jake 2012 2 Lisa Engineering 2004

3 Sue HR 3 Sue 2014 3 Sue HR 2014

df3 = pd.DataFrame({'name': ['Bob', 'Jake', 'Lisa', 'Sue'],


'salary': [70000, 80000, 120000, 90000]})
display('df1', 'df3', 'pd.merge(df1, df3, left_on="employee", right_on="name")')

https://colab.research.google.com/drive/1qhtF1UDqi1b5pB3ZB8MJZo5eH3RerI7L#printMode=true 15/22
27/04/2024, 14:23 pandas.ipynb - Colab

df1 df3 pd.merge(df1, df3, left_on="employee", right_on="name")

employee group name salary employee group name salary

0 Bob Accounting 0 Bob 70000 0 Bob Accounting Bob 70000

1 Jake Engineering 1 Jake 80000 1 Jake Engineering Jake 80000

2 Lisa Engineering 2 Lisa 120000 2 Lisa Engineering Lisa 120000

3 Sue HR 3 Sue 90000 3 Sue HR Sue 90000

pd.merge(df1, df3, left_on="employee", right_on="name").drop('name', axis=1)

employee group salary

0 Bob Accounting 70000

1 Jake Engineering 80000

2 Lisa Engineering 120000

3 Sue HR 90000

df1a = df1.set_index('employee')
df2a = df2.set_index('employee')
display('df1a', 'df2a')

df1a df2a

group hire_date

employee employee

Bob Accounting Lisa 2004

Jake Engineering Bob 2008

Lisa Engineering Jake 2012

Sue HR Sue 2014

display('df1a', 'df2a',
"pd.merge(df1a, df2a, left_index=True, right_index=True)")

df1a df2a pd.merge(df1a, df2a, left_index=True, right_index=True)

group hire_date group hire_date

employee employee employee

Bob Accounting Lisa 2004 Bob Accounting 2008

Jake Engineering Bob 2008 Jake Engineering 2012

Lisa Engineering Jake 2012 Lisa Engineering 2004

Sue HR Sue 2014 Sue HR 2014

display('df1a', 'df3', "pd.merge(df1a, df3, left_index=True, right_on='name')")

df1a df3 pd.merge(df1a, df3, left_index=True, right_on='name')

group name salary group name salary

employee 0 Bob 70000 0 Accounting Bob 70000

Bob Accounting 1 Jake 80000 1 Engineering Jake 80000

Jake Engineering 2 Lisa 120000 2 Engineering Lisa 120000

Lisa Engineering 3 Sue 90000 3 HR Sue 90000

Sue HR

https://colab.research.google.com/drive/1qhtF1UDqi1b5pB3ZB8MJZo5eH3RerI7L#printMode=true 16/22
27/04/2024, 14:23 pandas.ipynb - Colab
df6 = pd.DataFrame({'name': ['Peter', 'Paul', 'Mary'],
'food': ['fish', 'beans', 'bread']},
columns=['name', 'food'])
df7 = pd.DataFrame({'name': ['Mary', 'Joseph'],
'drink': ['wine', 'beer']},
columns=['name', 'drink'])
display('df6', 'df7', 'pd.merge(df6, df7)')

df6 df7 pd.merge(df6, df7)

name food name drink name food drink

0 Peter fish 0 Mary wine 0 Mary bread wine

1 Paul beans 1 Joseph beer

2 Mary bread

pd.merge(df6, df7, how='inner')

name food drink

0 Mary bread wine

display('df6', 'df7', "pd.merge(df6, df7, how='left')")

df6 df7 pd.merge(df6, df7, how='left')

name food name drink name food drink

0 Peter fish 0 Mary wine 0 Peter fish NaN

1 Paul beans 1 Joseph beer 1 Paul beans NaN

2 Mary bread 2 Mary bread wine

df8 = pd.DataFrame({'name': ['Bob', 'Jake', 'Lisa', 'Sue'],


'rank': [1, 2, 3, 4]})
df9 = pd.DataFrame({'name': ['Bob', 'Jake', 'Lisa', 'Sue'],
'rank': [3, 1, 4, 2]})
display('df8', 'df9', 'pd.merge(df8, df9, on="name")')

df8 df9 pd.merge(df8, df9, on="name")

name rank name rank name rank_x rank_y

0 Bob 1 0 Bob 3 0 Bob 1 3

1 Jake 2 1 Jake 1 1 Jake 2 1

2 Lisa 3 2 Lisa 4 2 Lisa 3 4

3 Sue 4 3 Sue 2 3 Sue 4 2

display('df8', 'df9', 'pd.merge(df8, df9, on="name", suffixes=["_L", "_R"])')

df8 df9 pd.merge(df8, df9, on="name", suffixes=["_L", "_R"])

name rank name rank name rank_L rank_R

0 Bob 1 0 Bob 3 0 Bob 1 3

1 Jake 2 1 Jake 1 1 Jake 2 1

2 Lisa 3 2 Lisa 4 2 Lisa 3 4

3 Sue 4 3 Sue 2 3 Sue 4 2

def make_df(cols, ind):


"""Quickly make a DataFrame"""
data = {c: [str(c) + str(i) for i in ind]
for c in cols}
return pd.DataFrame(data, ind)

# example DataFrame
make_df('ABC', range(3))

https://colab.research.google.com/drive/1qhtF1UDqi1b5pB3ZB8MJZo5eH3RerI7L#printMode=true 17/22
27/04/2024, 14:23 pandas.ipynb - Colab

A B C

0 A0 B0 C0

1 A1 B1 C1

2 A2 B2 C2

pd.concat(objs, axis=0, join='outer', join_axes=None, ignore_index=False,


keys=None, levels=None, names=None, verify_integrity=False,
copy=True)

ser1 = pd.Series(['A', 'B', 'C'], index=[1, 2, 3])


ser2 = pd.Series(['D', 'E', 'F'], index=[4, 5, 6])
pd.concat([ser1, ser2])

1 A
2 B
3 C
4 D
5 E
6 F
dtype: object

df1 = make_df('AB', [1, 2])


df2 = make_df('AB', [3, 4])
display('df1', 'df2', 'pd.concat([df1, df2])')

df1 df2 pd.concat([df1, df2])

A B A B A B

1 A1 B1 3 A3 B3 1 A1 B1

2 A2 B2 4 A4 B4 2 A2 B2

3 A3 B3

4 A4 B4

df3 = make_df('AB', [0, 1])


df4 = make_df('CD', [0, 1])
display('df3', 'df4', "pd.concat([df3, df4], axis=1)")

df3 df4 pd.concat([df3, df4], axis=1)

A B C D A B C D

0 A0 B0 0 C0 D0 0 A0 B0 C0 D0

1 A1 B1 1 C1 D1 1 A1 B1 C1 D1

x = make_df('AB', [0, 1])


y = make_df('AB', [2, 3])
y.index = x.index # make duplicate indices!
display('x', 'y', 'pd.concat([x, y])')

x y pd.concat([x, y])

A B A B A B

0 A0 B0 0 A2 B2 0 A0 B0

1 A1 B1 1 A3 B3 1 A1 B1

0 A2 B2

1 A3 B3

try:
pd.concat([x, y], verify_integrity=True)
except ValueError as e:
print("ValueError:", e)

ValueError: Indexes have overlapping values: Index([0, 1], dtype='int64')

display('x', 'y', 'pd.concat([x, y], ignore_index=True)')

https://colab.research.google.com/drive/1qhtF1UDqi1b5pB3ZB8MJZo5eH3RerI7L#printMode=true 18/22
27/04/2024, 14:23 pandas.ipynb - Colab

x y pd.concat([x, y], ignore_index=True)

A B A B A B

0 A0 B0 0 A2 B2 0 A0 B0

1 A1 B1 1 A3 B3 1 A1 B1

2 A2 B2

3 A3 B3

display('x', 'y', "pd.concat([x, y], keys=['x', 'y'])")

x y pd.concat([x, y], keys=['x', 'y'])

A B A B A B

0 A0 B0 0 A2 B2 x 0 A0 B0

1 A1 B1 1 A3 B3 1 A1 B1

y 0 A2 B2

1 A3 B3

df5 = make_df('ABC', [1, 2])


df6 = make_df('BCD', [3, 4])
display('df5', 'df6', 'pd.concat([df5, df6])')

df5 df6 pd.concat([df5, df6])

A B C B C D A B C D

1 A1 B1 C1 3 B3 C3 D3 1 A1 B1 C1 NaN

2 A2 B2 C2 4 B4 C4 D4 2 A2 B2 C2 NaN

3 NaN B3 C3 D3

4 NaN B4 C4 D4

pd.merge(df5, df6, on="B")

A B C_x C_y D

display('df5', 'df6',
"pd.concat([df5, df6], join='inner')")

df5 df6 pd.concat([df5, df6], join='inner')

A B C B C D B C

1 A1 B1 C1 3 B3 C3 D3 1 B1 C1

2 A2 B2 C2 4 B4 C4 D4 2 B2 C2

3 B3 C3

4 B4 C4

df1.append(df2)

---------------------------------------------------------------------------
AttributeError Traceback (most recent call last)
<ipython-input-36-8ab0723181fb> in <cell line: 1>()
----> 1 df1.append(df2)

/usr/local/lib/python3.10/dist-packages/pandas/core/generic.py in __getattr__(self, name)


5987 ):
5988 return self[name]
-> 5989 return object.__getattribute__(self, name)
5990
5991 def __setattr__(self, name: str, value) -> None:

AttributeError: 'DataFrame' object has no attribute 'append'

https://colab.research.google.com/drive/1qhtF1UDqi1b5pB3ZB8MJZo5eH3RerI7L#printMode=true 19/22
27/04/2024, 14:23 pandas.ipynb - Colab
df = pd.read_csv('/content/ETH_1h.csv')

df.head()

Date Symbol Open High Low Close Volume

0 2020-03-13 08-PM ETHUSD 129.94 131.82 126.87 128.71 1940673.93

1 2020-03-13 07-PM ETHUSD 119.51 132.02 117.10 129.94 7579741.09

2 2020-03-13 06-PM ETHUSD 124.47 124.85 115.50 119.51 4898735.81

3 2020-03-13 05-PM ETHUSD 124.08 127.42 121.63 124.47 2753450.92

4 2020-03-13 04-PM ETHUSD 124.85 129.51 120.17 124.08 4461424.71

df.loc[0,'Date']

'2020-03-13 08-PM'

df['Date'] = pd.to_datetime(df['Date'], format = '%Y-%m-%d %I-%p')

df['Date']

0 2020-03-13 20:00:00
1 2020-03-13 19:00:00
2 2020-03-13 18:00:00
3 2020-03-13 17:00:00
4 2020-03-13 16:00:00
...
23669 2017-07-01 15:00:00
23670 2017-07-01 14:00:00
23671 2017-07-01 13:00:00
23672 2017-07-01 12:00:00
23673 2017-07-01 11:00:00
Name: Date, Length: 23674, dtype: datetime64[ns]

df.loc[0,'Date'].day_name()

'Friday'

df['Date'].dt.day_name()

0 Friday
1 Friday
2 Friday
3 Friday
4 Friday
...
23669 Saturday
23670 Saturday
23671 Saturday
23672 Saturday
23673 Saturday
Name: Date, Length: 23674, dtype: object

df['DayOfWeek'] = df['Date'].dt.day_name()

df

https://colab.research.google.com/drive/1qhtF1UDqi1b5pB3ZB8MJZo5eH3RerI7L#printMode=true 20/22
27/04/2024, 14:23 pandas.ipynb - Colab

Date Symbol Open High Low Close Volume DayOfWeek

0 2020-03-13 20:00:00 ETHUSD 129.94 131.82 126.87 128.71 1940673.93 Friday

1 2020-03-13 19:00:00 ETHUSD 119.51 132.02 117.10 129.94 7579741.09 Friday

2 2020-03-13 18:00:00 ETHUSD 124.47 124.85 115.50 119.51 4898735.81 Friday

3 2020-03-13 17:00:00 ETHUSD 124.08 127.42 121.63 124.47 2753450.92 Friday

4 2020-03-13 16:00:00 ETHUSD 124.85 129.51 120.17 124.08 4461424.71 Friday

... ... ... ... ... ... ... ... ...

23669 2017-07-01 15:00:00 ETHUSD 265.74 272.74 265.00 272.57 1500282.55 Saturday

23670 2017-07-01 14:00:00 ETHUSD 268.79 269.90 265.00 265.74 1702536.85 Saturday

23671 2017-07-01 13:00:00 ETHUSD 274.83 274.93 265.00 268.79 3010787.99 Saturday

23672 2017-07-01 12:00:00 ETHUSD 275.01 275.01 271.00 274.83 824362.87 Saturday

23673 2017-07-01 11:00:00 ETHUSD 279.98 279.99 272.10 275.01 679358.87 Saturday

23674 rows × 8 columns

df['Date'].min()

Timestamp('2017-07-01 11:00:00')

df['Date'].max()

Timestamp('2020-03-13 20:00:00')

df['Date'].max() - df['Date'].min() # time delta

Timedelta('986 days 09:00:00')

df.loc[df['Date'] >= '2020']

Date Symbol Open High Low Close Volume DayOfWeek

0 2020-03-13 20:00:00 ETHUSD 129.94 131.82 126.87 128.71 1940673.93 Friday

1 2020-03-13 19:00:00 ETHUSD 119.51 132.02 117.10 129.94 7579741.09 Friday

2 2020-03-13 18:00:00 ETHUSD 124.47 124.85 115.50 119.51 4898735.81 Friday

3 2020-03-13 17:00:00 ETHUSD 124.08 127.42 121.63 124.47 2753450.92 Friday

4 2020-03-13 16:00:00 ETHUSD 124.85 129.51 120.17 124.08 4461424.71 Friday

... ... ... ... ... ... ... ... ...

1744 2020-01-01 04:00:00 ETHUSD 129.57 130.00 129.50 129.56 702786.82 Wednesday

1745 2020-01-01 03:00:00 ETHUSD 130.37 130.44 129.38 129.57 496704.23 Wednesday

1746 2020-01-01 02:00:00 ETHUSD 130.14 130.50 129.91 130.37 396315.72 Wednesday

1747 2020-01-01 01:00:00 ETHUSD 128.34 130.14 128.32 130.14 635419.40 Wednesday

1748 2020-01-01 00:00:00 ETHUSD 128.54 128.54 128.12 128.34 245119.91 Wednesday

1749 rows × 8 columns

df.loc[(df['Date'] >= '2019') & (df['Date'] < '2020')]

https://colab.research.google.com/drive/1qhtF1UDqi1b5pB3ZB8MJZo5eH3RerI7L#printMode=true 21/22
27/04/2024, 14:23 pandas.ipynb - Colab

Date Symbol Open High Low Close Volume DayOfWeek

1749 2019-12-31 23:00:00 ETHUSD 128.33 128.69 128.14 128.54 440678.91 Tuesday

1750 2019-12-31 22:00:00 ETHUSD 128.38 128.69 127.95 128.33 554646.02 Tuesday
df.loc[(df['Date'] >= pd.to_datetime('2019-01-01')) & (df['Date'] < pd.to_datetime('2020-01-01'))]
1751 2019-12-31 21:00:00 ETHUSD 127.86 128.43 127.72 128.38 350155.69 Tuesday
Date Symbol Open High Low Close Volume DayOfWeek
1752 2019-12-31 20:00:00 ETHUSD 127.84 128.34 127.71 127.86 428183.38 Tuesday
1749 2019-12-31 23:00:00 ETHUSD 128.33 128.69 128.14 128.54 440678.91 Tuesday
1753 2019-12-31 19:00:00 ETHUSD 128.69 128.69 127.60 127.84 1169847.84 Tuesday
1750 2019-12-31 22:00:00 ETHUSD 128.38 128.69 127.95 128.33 554646.02 Tuesday
... ... ... ... ... ... ... ... ...
1751 2019-12-31 21:00:00 ETHUSD 127.86 128.43 127.72 128.38 350155.69 Tuesday
10504 2019-01-01 04:00:00 ETHUSD 130.75 133.96 130.74 131.96 2791135.37 Tuesday
1752 2019-12-31 20:00:00 ETHUSD 127.84 128.34 127.71 127.86 428183.38 Tuesday
10505 2019-01-01 03:00:00 ETHUSD 130.06 130.79 130.06 130.75 503732.63 Tuesday
1753 2019-12-31 19:00:00 ETHUSD 128.69 128.69 127.60 127.84 1169847.84 Tuesday
10506 2019-01-01 02:00:00 ETHUSD 130.79 130.88 129.55 130.06 838183.43 Tuesday
... ... ... ... ... ... ... ... ...
10507 2019-01-01 01:00:00 ETHUSD 131.62 131.62 130.77 130.79 434917.99 Tuesday
10504 2019-01-01 04:00:00 ETHUSD 130.75 133.96 130.74 131.96 2791135.37 Tuesday
10508 2019-01-01 00:00:00 ETHUSD 130.53 131.91 130.48 131.62 1067136.21 Tuesday
10505 2019-01-01 03:00:00 ETHUSD 130.06 130.79 130.06 130.75 503732.63 Tuesday
8760 rows × 8 columns
10506 2019-01-01 02:00:00 ETHUSD 130.79 130.88 129.55 130.06 838183.43 Tuesday

10507 2019-01-01 01:00:00 ETHUSD 131.62 131.62 130.77 130.79 434917.99 Tuesday

10508 2019-01-01 00:00:00 ETHUSD 130.53 131.91 130.48 131.62 1067136.21 Tuesday

8760 rows × 8 columns

df.set_index('Date', inplace=True)

df

Symbol Open High Low Close Volume DayOfWeek

Date

2020-03-13 20:00:00 ETHUSD 129.94 131.82 126.87 128.71 1940673.93 Friday

2020-03-13 19:00:00 ETHUSD 119.51 132.02 117.10 129.94 7579741.09 Friday

2020-03-13 18:00:00 ETHUSD 124.47 124.85 115.50 119.51 4898735.81 Friday

2020-03-13 17:00:00 ETHUSD 124.08 127.42 121.63 124.47 2753450.92 Friday

2020-03-13 16:00:00 ETHUSD 124.85 129.51 120.17 124.08 4461424.71 Friday

... ... ... ... ... ... ... ...

2017-07-01 15:00:00 ETHUSD 265.74 272.74 265.00 272.57 1500282.55 Saturday

2017-07-01 14:00:00 ETHUSD 268.79 269.90 265.00 265.74 1702536.85 Saturday

2017-07-01 13:00:00 ETHUSD 274.83 274.93 265.00 268.79 3010787.99 Saturday

2017-07-01 12:00:00 ETHUSD 275.01 275.01 271.00 274.83 824362.87 Saturday

2017-07-01 11:00:00 ETHUSD 279.98 279.99 272.10 275.01 679358.87 Saturday

23674 rows × 7 columns

df.loc['2019']

Symbol Open High Low Close Volume DayOfWeek

Date

2019-12-31 23:00:00 ETHUSD 128.33 128.69 128.14 128.54 440678.91 Tuesday

2019-12-31 22:00:00 ETHUSD 128.38 128.69 127.95 128.33 554646.02 Tuesday

2019-12-31 21:00:00 ETHUSD 127.86 128.43 127.72 128.38 350155.69 Tuesday

2019-12-31 20:00:00 ETHUSD 127.84 128.34 127.71 127.86 428183.38 Tuesday

2019-12-31 19:00:00 ETHUSD 128.69 128.69 127.60 127.84 1169847.84 Tuesday

https://colab.research.google.com/drive/1qhtF1UDqi1b5pB3ZB8MJZo5eH3RerI7L#printMode=true 22/22

You might also like