0% found this document useful (0 votes)
12 views

pandas

The document provides an overview of the Pandas library in Python, focusing on its data structures, Series and DataFrame. It includes examples of creating and manipulating Series, as well as reading and analyzing data using DataFrames. The document also covers basic operations and statistics on the data, demonstrating how to work with different data types and perform data analysis.

Uploaded by

rajubandam694
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
12 views

pandas

The document provides an overview of the Pandas library in Python, focusing on its data structures, Series and DataFrame. It includes examples of creating and manipulating Series, as well as reading and analyzing data using DataFrames. The document also covers basic operations and statistics on the data, demonstrating how to work with different data types and perform data analysis.

Uploaded by

rajubandam694
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 199

gw8qeblee

January 23, 2025

[ ]:

1 What is Pandas?
Pandas is one of the most important libraries of Python. Pandas has data structures for data anal-
ysis. The most used of these are Series and DataFrame data structures. Series is one dimensional,
that is, it consists of a column. Data frame is two-dimensional, i.e. it consists of rows and columns.
To install Pandas, you can use “pip install pandas”
[1]: import pandas as pd # Let's import pandas with pd

[2]: pd.__version__ # To print the installed vesion pandas

[2]: '1.1.3'

from file: 02-Series Data Structure

2 Series Data Structure


[1]: import pandas as pd # Let's import Pandas with pd.

[2]: obj=pd.Series([1,"John",3.5,"Hey"])
obj

[2]: 0 1
1 John
2 3.5
3 Hey
dtype: object

[3]: obj[0]

[3]: 1

[4]: obj.values

1
[4]: array([1, 'John', 3.5, 'Hey'], dtype=object)

[5]: obj2=pd.Series([1,"John",3.5,"Hey"],index=["a","b","c","d"])
obj2

[5]: a 1
b John
c 3.5
d Hey
dtype: object

[6]: obj2["b"]

[6]: 'John'

[7]: obj2.index

[7]: Index(['a', 'b', 'c', 'd'], dtype='object')

[8]: score={"Jane":90, "Bill":80,"Elon":85,"Tom":75,"Tim":95}


names=pd.Series(score) # Convert to Series
names

[8]: Jane 90
Bill 80
Elon 85
Tom 75
Tim 95
dtype: int64

[9]: names["Tim"]

[9]: 95

[10]: names[names>=85]

[10]: Jane 90
Elon 85
Tim 95
dtype: int64

[11]: names["Tom"]=60
names

[11]: Jane 90
Bill 80
Elon 85

2
Tom 60
Tim 95
dtype: int64

[12]: names[names<=80]=83
names

[12]: Jane 90
Bill 83
Elon 85
Tom 83
Tim 95
dtype: int64

[13]: "Tom" in names

[13]: True

[14]: "Can" in names

[14]: False

[15]: names/10

[15]: Jane 9.0


Bill 8.3
Elon 8.5
Tom 8.3
Tim 9.5
dtype: float64

[16]: names**2

[16]: Jane 8100


Bill 6889
Elon 7225
Tom 6889
Tim 9025
dtype: int64

[17]: names.isnull()

[17]: Jane False


Bill False
Elon False
Tom False
Tim False

3
dtype: bool

from file: 03-Working with Series

3 Working with Series Data Structure


[1]: import pandas as pd

[2]: games=pd.read_csv("Data/vgsalesGlobale.csv")

[3]: games.head()

[3]: Rank Name Platform Year Genre Publisher \


0 1 Wii Sports Wii 2006.0 Sports Nintendo
1 2 Super Mario Bros. NES 1985.0 Platform Nintendo
2 3 Mario Kart Wii Wii 2008.0 Racing Nintendo
3 4 Wii Sports Resort Wii 2009.0 Sports Nintendo
4 5 Pokemon Red/Pokemon Blue GB 1996.0 Role-Playing Nintendo

NA_Sales EU_Sales JP_Sales Other_Sales Global_Sales


0 41.49 29.02 3.77 8.46 82.74
1 29.08 3.58 6.81 0.77 40.24
2 15.85 12.88 3.79 3.31 35.82
3 15.75 11.01 3.28 2.96 33.00
4 11.27 8.89 10.22 1.00 31.37

[4]: games.dtypes

[4]: Rank int64


Name object
Platform object
Year float64
Genre object
Publisher object
NA_Sales float64
EU_Sales float64
JP_Sales float64
Other_Sales float64
Global_Sales float64
dtype: object

[5]: games.Genre.describe()

[5]: count 16598


unique 12
top Action

4
freq 3316
Name: Genre, dtype: object

[6]: games.Genre.value_counts()

[6]: Action 3316


Sports 2346
Misc 1739
Role-Playing 1488
Shooter 1310
Adventure 1286
Racing 1249
Platform 886
Simulation 867
Fighting 848
Strategy 681
Puzzle 582
Name: Genre, dtype: int64

[7]: games.Genre.value_counts(normalize=True)

[7]: Action 0.199783


Sports 0.141342
Misc 0.104772
Role-Playing 0.089649
Shooter 0.078925
Adventure 0.077479
Racing 0.075250
Platform 0.053380
Simulation 0.052235
Fighting 0.051090
Strategy 0.041029
Puzzle 0.035064
Name: Genre, dtype: float64

[8]: type(games.Genre.value_counts())

[8]: pandas.core.series.Series

[9]: games.Genre.value_counts().head()

[9]: Action 3316


Sports 2346
Misc 1739
Role-Playing 1488
Shooter 1310
Name: Genre, dtype: int64

5
[10]: games.Genre.unique()

[10]: array(['Sports', 'Platform', 'Racing', 'Role-Playing', 'Puzzle', 'Misc',


'Shooter', 'Simulation', 'Action', 'Fighting', 'Adventure',
'Strategy'], dtype=object)

[11]: games.Genre.nunique()

[11]: 12

[12]: pd.crosstab(games.Genre, games.Year)

[12]: Year 1980.0 1981.0 1982.0 1983.0 1984.0 1985.0 1986.0 1987.0 \
Genre
Action 1 25 18 7 1 2 6 2
Adventure 0 0 0 1 0 0 0 1
Fighting 1 0 0 0 0 1 0 2
Misc 4 0 1 1 1 0 0 0
Platform 0 3 5 5 1 4 6 2
Puzzle 0 2 3 1 3 4 0 0
Racing 0 1 2 0 3 0 1 0
Role-Playing 0 0 0 0 0 0 1 3
Shooter 2 10 5 1 3 1 4 2
Simulation 0 1 0 0 0 1 0 0
Sports 1 4 2 1 2 1 3 4
Strategy 0 0 0 0 0 0 0 0

Year 1988.0 1989.0 … 2009.0 2010.0 2011.0 2012.0 2013.0 \


Genre …
Action 2 2 … 272 226 239 266 148
Adventure 0 0 … 141 154 108 58 60
Fighting 0 0 … 53 40 50 29 20
Misc 0 1 … 207 201 184 38 42
Platform 4 3 … 29 31 37 12 37
Puzzle 1 5 … 79 45 43 11 3
Racing 1 0 … 84 57 65 30 16
Role-Playing 3 2 … 103 103 95 78 71
Shooter 1 1 … 91 81 94 48 59
Simulation 1 0 … 123 82 56 18 18
Sports 2 3 … 184 186 122 54 53
Strategy 0 0 … 65 53 46 15 19

Year 2014.0 2015.0 2016.0 2017.0 2020.0


Genre
Action 186 255 119 1 0
Adventure 75 54 34 0 0
Fighting 23 21 14 0 0

6
Misc 41 39 18 0 0
Platform 10 14 10 0 0
Puzzle 8 6 0 0 0
Racing 27 19 20 0 0
Role-Playing 91 78 40 2 0
Shooter 47 34 32 0 0
Simulation 11 15 9 0 1
Sports 55 62 38 0 0
Strategy 8 17 10 0 0

[12 rows x 39 columns]

[13]: games.Global_Sales.describe()

[13]: count 16598.000000


mean 0.537441
std 1.555028
min 0.010000
25% 0.060000
50% 0.170000
75% 0.470000
max 82.740000
Name: Global_Sales, dtype: float64

[14]: games.Global_Sales.mean()

[14]: 0.5374406555006628

[15]: games.Global_Sales.value_counts()

[15]: 0.02 1071


0.03 811
0.04 645
0.05 632
0.01 618

9.09 1
12.27 1
16.38 1
20.22 1
22.00 1
Name: Global_Sales, Length: 623, dtype: int64

[16]: games.Year.plot(kind="hist")

[16]: <AxesSubplot:ylabel='Frequency'>

7
[17]: games.Genre.value_counts()

[17]: Action 3316


Sports 2346
Misc 1739
Role-Playing 1488
Shooter 1310
Adventure 1286
Racing 1249
Platform 886
Simulation 867
Fighting 848
Strategy 681
Puzzle 582
Name: Genre, dtype: int64

[18]: games.Genre.value_counts().plot(kind="bar")

[18]: <AxesSubplot:>

8
from file: 04-DataFrame

4 DataFrame
4.1 What is DataFrame?
[1]: import pandas as pd

[2]: data={"name":["Bill","Tom","Tim","John","Alex","Vanessa","Kate"],
"score":[90,80,85,75,95,60,65],
"sport":["Wrestling","Football","Skiing","Swimming","Tennis",
"Karete","Surfing"],
"sex":["M","M","M","M","F","F","F"]}

[3]: df=pd.DataFrame(data)

[4]: df

[4]: name score sport sex


0 Bill 90 Wrestling M
1 Tom 80 Football M
2 Tim 85 Skiing M

9
3 John 75 Swimming M
4 Alex 95 Tennis F
5 Vanessa 60 Karete F
6 Kate 65 Surfing F

[5]: df=pd.DataFrame(data,columns=["name","sport","sex","score"])
df

[5]: name sport sex score


0 Bill Wrestling M 90
1 Tom Football M 80
2 Tim Skiing M 85
3 John Swimming M 75
4 Alex Tennis F 95
5 Vanessa Karete F 60
6 Kate Surfing F 65

[6]: df.head()

[6]: name sport sex score


0 Bill Wrestling M 90
1 Tom Football M 80
2 Tim Skiing M 85
3 John Swimming M 75
4 Alex Tennis F 95

[7]: df.tail()

[7]: name sport sex score


2 Tim Skiing M 85
3 John Swimming M 75
4 Alex Tennis F 95
5 Vanessa Karete F 60
6 Kate Surfing F 65

[8]: df.tail(3)

[8]: name sport sex score


4 Alex Tennis F 95
5 Vanessa Karete F 60
6 Kate Surfing F 65

[9]: df.head(2)

[9]: name sport sex score


0 Bill Wrestling M 90
1 Tom Football M 80

10
[10]: df=pd.DataFrame(data,columns=["name", "sport", "gender", "score", "age"])
df

[10]: name sport gender score age


0 Bill Wrestling NaN 90 NaN
1 Tom Football NaN 80 NaN
2 Tim Skiing NaN 85 NaN
3 John Swimming NaN 75 NaN
4 Alex Tennis NaN 95 NaN
5 Vanessa Karete NaN 60 NaN
6 Kate Surfing NaN 65 NaN

[11]: df=pd.DataFrame(data,columns=["name", "sport", "gender", "score", "age"],


index=["one","two","three","four","five","six","seven"])
df

[11]: name sport gender score age


one Bill Wrestling NaN 90 NaN
two Tom Football NaN 80 NaN
three Tim Skiing NaN 85 NaN
four John Swimming NaN 75 NaN
five Alex Tennis NaN 95 NaN
six Vanessa Karete NaN 60 NaN
seven Kate Surfing NaN 65 NaN

[12]: df["sport"]

[12]: one Wrestling


two Football
three Skiing
four Swimming
five Tennis
six Karete
seven Surfing
Name: sport, dtype: object

[13]: my_columns=["name","sport"]
df[my_columns]

[13]: name sport


one Bill Wrestling
two Tom Football
three Tim Skiing
four John Swimming
five Alex Tennis
six Vanessa Karete
seven Kate Surfing

11
[14]: df.sport

[14]: one Wrestling


two Football
three Skiing
four Swimming
five Tennis
six Karete
seven Surfing
Name: sport, dtype: object

[15]: df.loc[["one"]]

[15]: name sport gender score age


one Bill Wrestling NaN 90 NaN

[16]: df.loc[["one","two"]]

[16]: name sport gender score age


one Bill Wrestling NaN 90 NaN
two Tom Football NaN 80 NaN

[17]: df["age"]=18

[18]: df=pd.DataFrame(data,columns=["name", "sport", "gender", "score", "age"],


index=["one","two","three","four","five","six","seven"])
values=[18,19,20,18,17,17,18]
df["age"]=values
df

[18]: name sport gender score age


one Bill Wrestling NaN 90 18
two Tom Football NaN 80 19
three Tim Skiing NaN 85 20
four John Swimming NaN 75 18
five Alex Tennis NaN 95 17
six Vanessa Karete NaN 60 17
seven Kate Surfing NaN 65 18

[19]: df["pass"]=df.score>=70
df

[19]: name sport gender score age pass


one Bill Wrestling NaN 90 18 True
two Tom Football NaN 80 19 True
three Tim Skiing NaN 85 20 True
four John Swimming NaN 75 18 True

12
five Alex Tennis NaN 95 17 True
six Vanessa Karete NaN 60 17 False
seven Kate Surfing NaN 65 18 False

[20]: del df["pass"]


df

[20]: name sport gender score age


one Bill Wrestling NaN 90 18
two Tom Football NaN 80 19
three Tim Skiing NaN 85 20
four John Swimming NaN 75 18
five Alex Tennis NaN 95 17
six Vanessa Karete NaN 60 17
seven Kate Surfing NaN 65 18

[21]: scores={"Math":{"A":85,"B":90,"C":95}, "Physics":{"A":90,"B":80,"C":75}}

[22]: scores_df=pd.DataFrame(scores)
scores_df

[22]: Math Physics


A 85 90
B 90 80
C 95 75

[23]: scores_df.T

[23]: A B C
Math 85 90 95
Physics 90 80 75

[24]: scores_df.index.name="name"
scores_df.columns.name="lesson"

[25]: scores_df

[25]: lesson Math Physics


name
A 85 90
B 90 80
C 95 75

[26]: scores_df.values

[26]: array([[85, 90],


[90, 80],

13
[95, 75]], dtype=int64)

[27]: scores_index=scores_df.index

[28]: scores_index[1]="Jack"
scores_index

---------------------------------------------------------------------------
TypeError Traceback (most recent call last)
<ipython-input-28-e9a61a166ed1> in <module>
----> 1 scores_index[1]="Jack"
2 scores_index

~\anaconda3\envs\tensorflow\lib\site-packages\pandas\core\indexes\base.py in␣
↪__setitem__(self, key, value)

4079
4080 def __setitem__(self, key, value):
-> 4081 raise TypeError("Index does not support mutable operations")
4082
4083 def __getitem__(self, key):

TypeError: Index does not support mutable operations

from file: 05-Indexing-Selecting-Filtering

5 Indexing & Selection & Filtering in Pandas


[1]: import pandas as pd
import numpy as np

[2]: obj=pd.Series(np.arange(5),
index=["a","b","c","d","e"])

[3]: obj

[3]: a 0
b 1
c 2
d 3
e 4
dtype: int32

[4]: obj["c"]

[4]: 2

14
[5]: obj[2]

[5]: 2

[6]: obj[0:3]

[6]: a 0
b 1
c 2
dtype: int32

[7]: obj[["a","c"]]

[7]: a 0
c 2
dtype: int32

[8]: obj[[0,2]]

[8]: a 0
c 2
dtype: int32

[9]: obj[obj<2]

[9]: a 0
b 1
dtype: int32

[10]: obj["a":"c"]

[10]: a 0
b 1
c 2
dtype: int32

[11]: obj["b":"c"]=5
obj

[11]: a 0
b 5
c 5
d 3
e 4
dtype: int32

15
5.1 DataFrame Indexing
[12]: data=pd.DataFrame(
np.arange(16).reshape(4,4),
index=["London","Paris",
"Berlin","Istanbul"],
columns=["one","two","three","four"])
data

[12]: one two three four


London 0 1 2 3
Paris 4 5 6 7
Berlin 8 9 10 11
Istanbul 12 13 14 15

[13]: data["two"]

[13]: London 1
Paris 5
Berlin 9
Istanbul 13
Name: two, dtype: int32

[14]: data[["one","two"]]

[14]: one two


London 0 1
Paris 4 5
Berlin 8 9
Istanbul 12 13

[15]: data[:3]

[15]: one two three four


London 0 1 2 3
Paris 4 5 6 7
Berlin 8 9 10 11

[16]: data[data["four"]>5]

[16]: one two three four


Paris 4 5 6 7
Berlin 8 9 10 11
Istanbul 12 13 14 15

[17]: data[data<5]=0
data

16
[17]: one two three four
London 0 0 0 0
Paris 0 5 6 7
Berlin 8 9 10 11
Istanbul 12 13 14 15

5.2 Selecting with iloc and loc


[18]: data.iloc[1]

[18]: one 0
two 5
three 6
four 7
Name: Paris, dtype: int32

[19]: data.iloc[1,[1,2,3]]

[19]: two 5
three 6
four 7
Name: Paris, dtype: int32

[20]: data.iloc[[1,3],[1,2,3]]

[20]: two three four


Paris 5 6 7
Istanbul 13 14 15

[21]: data.loc["Paris",["one","two"]]

[21]: one 0
two 5
Name: Paris, dtype: int32

[22]: data.loc[:"Paris","four"]

[22]: London 0
Paris 7
Name: four, dtype: int32

[23]: toy_data=pd.Series(np.arange(5),
index=["a","b","c",
"d","e"])
toy_data

17
[23]: a 0
b 1
c 2
d 3
e 4
dtype: int32

[24]: toy_data[-1]

[24]: 4

from file: 06-Important Methods

6 Some Useful Methods in Pandas


[1]: import pandas as pd
import numpy as np

[2]: s=pd.Series([1,2,3,4],
index=["a","b","c","d"])
s

[2]: a 1
b 2
c 3
d 4
dtype: int64

[3]: s["a"]

[3]: 1

[4]: s2=s.reindex(["b","d","a","c","e"])
s2

[4]: b 2.0
d 4.0
a 1.0
c 3.0
e NaN
dtype: float64

[5]: s3=pd.Series(["blue","yellow","purple"],
index=[0,2,4])
s3

18
[5]: 0 blue
2 yellow
4 purple
dtype: object

[6]: s3.reindex(range(6),method="ffill")

[6]: 0 blue
1 blue
2 yellow
3 yellow
4 purple
5 purple
dtype: object

[7]: df=pd.DataFrame(np.arange(9).reshape(3,3),
index=["a","c","d"],
columns=["Tim","Tom","Kate"])
df

[7]: Tim Tom Kate


a 0 1 2
c 3 4 5
d 6 7 8

[8]: df2=df.reindex(["d","c","b","a"])
df2

[8]: Tim Tom Kate


d 6.0 7.0 8.0
c 3.0 4.0 5.0
b NaN NaN NaN
a 0.0 1.0 2.0

[9]: names=["Kate","Tim","Tom"]
df.reindex(columns=names)

[9]: Kate Tim Tom


a 2 0 1
c 5 3 4
d 8 6 7

[10]: df.loc[["c","d","a"]]

[10]: Tim Tom Kate


c 3 4 5
d 6 7 8

19
a 0 1 2

[11]: s=pd.Series(np.arange(5.),
index=["a","b","c","d","e"])
s

[11]: a 0.0
b 1.0
c 2.0
d 3.0
e 4.0
dtype: float64

[12]: new_s=s.drop("b")
new_s

[12]: a 0.0
c 2.0
d 3.0
e 4.0
dtype: float64

[13]: s.drop(["c","d"])

[13]: a 0.0
b 1.0
e 4.0
dtype: float64

[14]: data=pd.DataFrame(np.arange(16).reshape(4,4),
index=["Kate","Tim",
"Tom","Alex"],
columns=list("ABCD"))
data

[14]: A B C D
Kate 0 1 2 3
Tim 4 5 6 7
Tom 8 9 10 11
Alex 12 13 14 15

[15]: data.drop(["Kate","Tim"])

[15]: A B C D
Tom 8 9 10 11
Alex 12 13 14 15

20
[16]: data.drop("A",axis=1)

[16]: B C D
Kate 1 2 3
Tim 5 6 7
Tom 9 10 11
Alex 13 14 15

[17]: data.drop("Kate",axis=0)

[17]: A B C D
Tim 4 5 6 7
Tom 8 9 10 11
Alex 12 13 14 15

[18]: data

[18]: A B C D
Kate 0 1 2 3
Tim 4 5 6 7
Tom 8 9 10 11
Alex 12 13 14 15

[19]: data.mean(axis="index")

[19]: A 6.0
B 7.0
C 8.0
D 9.0
dtype: float64

[20]: data.mean(axis="columns")

[20]: Kate 1.5


Tim 5.5
Tom 9.5
Alex 13.5
dtype: float64

from file: 07-Arithmetic Operations

7 Arithmetic Operations in Pandas


[1]: import pandas as pd
import numpy as np

21
[2]: s1=pd.Series(np.arange(4),
index=["a","c","d","e"])
s2=pd.Series(np.arange(5),
index=["a","c","e","f","g"])

[3]: print(s1)
print(s2)

a 0
c 1
d 2
e 3
dtype: int32
a 0
c 1
e 2
f 3
g 4
dtype: int32

[4]: s1+s2

[4]: a 0.0
c 2.0
d NaN
e 5.0
f NaN
g NaN
dtype: float64

[5]: df1=pd.DataFrame(
np.arange(6).reshape(2,3),
columns=list("ABC"),
index=["Tim","Tom"])
df2=pd.DataFrame(
np.arange(9).reshape(3,3),
columns=list("ACD"),
index=["Tim","Kate","Tom"])

[6]: print(df1)
print(df2)

A B C
Tim 0 1 2
Tom 3 4 5
A C D
Tim 0 1 2

22
Kate 3 4 5
Tom 6 7 8

[7]: df1+df2

[7]: A B C D
Kate NaN NaN NaN NaN
Tim 0.0 NaN 3.0 NaN
Tom 9.0 NaN 12.0 NaN

[8]: df1.add(df2,fill_value=0)

[8]: A B C D
Kate 3.0 NaN 4.0 5.0
Tim 0.0 1.0 3.0 2.0
Tom 9.0 4.0 12.0 8.0

[9]: 1/df1

[9]: A B C
Tim inf 1.00 0.5
Tom 0.333333 0.25 0.2

[10]: df1*3

[10]: A B C
Tim 0 3 6
Tom 9 12 15

[11]: df1.mul(3)

[11]: A B C
Tim 0 3 6
Tom 9 12 15

[12]: df2

[12]: A C D
Tim 0 1 2
Kate 3 4 5
Tom 6 7 8

[13]: s=df2.iloc[1]
s

[13]: A 3
C 4

23
D 5
Name: Kate, dtype: int32

[14]: df2-s

[14]: A C D
Tim -3 -3 -3
Kate 0 0 0
Tom 3 3 3

[15]: s2=df2["A"]
s2

[15]: Tim 0
Kate 3
Tom 6
Name: A, dtype: int32

[16]: df2.sub(s2,axis="index")

[16]: A C D
Tim 0 1 2
Kate 0 1 2
Tom 0 1 2

7.1 Applying a Function


[17]: df=pd.DataFrame(
np.random.randn(4,3),
columns=list("ABC"),
index=["Kim","Susan","Tim","Tom"])
df

[17]: A B C
Kim -0.173717 -1.126917 -0.595042
Susan -0.641672 -0.073913 -1.828588
Tim -0.389124 -1.786140 0.553646
Tom -0.062436 -0.251933 0.872391

[18]: np.abs(df)

[18]: A B C
Kim 0.173717 1.126917 0.595042
Susan 0.641672 0.073913 1.828588
Tim 0.389124 1.786140 0.553646
Tom 0.062436 0.251933 0.872391

24
[19]: f=lambda x:x.max()-x.min()

[20]: df.apply(f)

[20]: A 0.579236
B 1.712227
C 2.700979
dtype: float64

[21]: df.apply(f,axis=1)

[21]: Kim 0.953200


Susan 1.754676
Tim 2.339785
Tom 1.124324
dtype: float64

[22]: def f(x):


return x**2

[23]: df.apply(f)

[23]: A B C
Kim 0.030178 1.269942 0.354075
Susan 0.411743 0.005463 3.343735
Tim 0.151417 3.190294 0.306524
Tom 0.003898 0.063470 0.761066

[ ]:

from file: 08-Sorting and Ranking

8 Sorting & Ranking in Pandas


[1]: import pandas as pd
import numpy as np

[2]: s=pd.Series(range(5),
index=["e","d","a","b","c"])
s

[2]: e 0
d 1
a 2
b 3
c 4

25
dtype: int64

[3]: s.sort_index()

[3]: a 2
b 3
c 4
d 1
e 0
dtype: int64

[4]: df=pd.DataFrame(
np.arange(12).reshape(3,4),
index=["two","one","three"],
columns=["d","a","b","c"])
df

[4]: d a b c
two 0 1 2 3
one 4 5 6 7
three 8 9 10 11

[5]: df.sort_index()

[5]: d a b c
one 4 5 6 7
three 8 9 10 11
two 0 1 2 3

[6]: df.sort_index(axis=1)

[6]: a b c d
two 1 2 3 0
one 5 6 7 4
three 9 10 11 8

[7]: df.sort_index(axis=1, ascending=False)

[7]: d c b a
two 0 3 2 1
one 4 7 6 5
three 8 11 10 9

[8]: s2=pd.Series([5,np.nan,3,-1,9])
s2

26
[8]: 0 5.0
1 NaN
2 3.0
3 -1.0
4 9.0
dtype: float64

[9]: s2.sort_values()

[9]: 3 -1.0
2 3.0
0 5.0
4 9.0
1 NaN
dtype: float64

[10]: df2=pd.DataFrame(
{"a":[5,3,-1,9],"b":[1,-2,0,5]})
df2

[10]: a b
0 5 1
1 3 -2
2 -1 0
3 9 5

[11]: df2.sort_values(by="b")

[11]: a b
1 3 -2
2 -1 0
0 5 1
3 9 5

[12]: df2.sort_values(by=["b","a"])

[12]: a b
1 3 -2
2 -1 0
0 5 1
3 9 5

8.1 Practice
Let’s practice using a real data set. You can download data set from
https://www.kaggle.com/melodyxyz/global.

27
[13]: data=pd.read_csv("Data/vgsalesGlobale.csv")

[14]: data.head()

[14]: Rank Name Platform Year Genre Publisher \


0 1 Wii Sports Wii 2006.0 Sports Nintendo
1 2 Super Mario Bros. NES 1985.0 Platform Nintendo
2 3 Mario Kart Wii Wii 2008.0 Racing Nintendo
3 4 Wii Sports Resort Wii 2009.0 Sports Nintendo
4 5 Pokemon Red/Pokemon Blue GB 1996.0 Role-Playing Nintendo

NA_Sales EU_Sales JP_Sales Other_Sales Global_Sales


0 41.49 29.02 3.77 8.46 82.74
1 29.08 3.58 6.81 0.77 40.24
2 15.85 12.88 3.79 3.31 35.82
3 15.75 11.01 3.28 2.96 33.00
4 11.27 8.89 10.22 1.00 31.37

[15]: data["Name"].sort_values(ascending=False)

[15]: 9135 ¡Shin


Chan Flipa en colores!
470 wwe
Smackdown vs. Raw 2006
15523 uDraw
Studio: Instant Artist
7835 uDraw
Studio: Instant Artist
627 uDraw Studio

8304 .hack//G.U. Vol.3//Redemption
8602 .hack//G.U. Vol.2//Reminisce (jp sales)
7107 .hack//G.U. Vol.2//Reminisce
8357 .hack//G.U. Vol.1//Rebirth
4754 '98 Koshien
Name: Name, Length: 16598, dtype: object

[16]: data.sort_values("Name")

[16]: Rank Name Platform Year \


4754 4756 '98 Koshien PS 1998.0
8357 8359 .hack//G.U. Vol.1//Rebirth PS2 2006.0
7107 7109 .hack//G.U. Vol.2//Reminisce PS2 2006.0
8602 8604 .hack//G.U. Vol.2//Reminisce (jp sales) PS2 2006.0
8304 8306 .hack//G.U. Vol.3//Redemption PS2 2007.0
… … … … …
627 628 uDraw Studio Wii 2010.0
7835 7837 uDraw Studio: Instant Artist Wii 2011.0
15523 15526 uDraw Studio: Instant Artist X360 2011.0
470 471 wwe Smackdown vs. Raw 2006 PS2 NaN
9135 9137 ¡Shin Chan Flipa en colores! DS 2007.0

28
Genre Publisher NA_Sales EU_Sales JP_Sales \
4754 Sports Magical Company 0.15 0.10 0.12
8357 Role-Playing Namco Bandai Games 0.00 0.00 0.17
7107 Role-Playing Namco Bandai Games 0.11 0.09 0.00
8602 Role-Playing Namco Bandai Games 0.00 0.00 0.16
8304 Role-Playing Namco Bandai Games 0.00 0.00 0.17
… … … … … …
627 Misc THQ 1.67 0.58 0.00
7835 Misc THQ 0.08 0.09 0.00
15523 Misc THQ 0.01 0.01 0.00
470 Fighting NaN 1.57 1.02 0.00
9135 Platform 505 Games 0.00 0.00 0.14

Other_Sales Global_Sales
4754 0.03 0.41
8357 0.00 0.17
7107 0.03 0.23
8602 0.00 0.16
8304 0.00 0.17
… … …
627 0.20 2.46
7835 0.02 0.19
15523 0.00 0.02
470 0.41 3.00
9135 0.00 0.14

[16598 rows x 11 columns]

[17]: data.sort_values("Year")

[17]: Rank Name Platform Year Genre \


6896 6898 Checkers 2600 1980.0 Misc
2669 2671 Boxing 2600 1980.0 Fighting
5366 5368 Freeway 2600 1980.0 Action
1969 1971 Defender 2600 1980.0 Misc
1766 1768 Kaboom! 2600 1980.0 Misc
… … … … … …
16307 16310 Freaky Flyers GC NaN Racing
16327 16330 Inversion PC NaN Shooter
16366 16369 Hakuouki: Shinsengumi Kitan PS3 NaN Adventure
16427 16430 Virtua Quest GC NaN Role-Playing
16493 16496 The Smurfs 3DS NaN Action

Publisher NA_Sales EU_Sales JP_Sales Other_Sales \


6896 Atari 0.22 0.01 0.0 0.00
2669 Activision 0.72 0.04 0.0 0.01

29
5366 Activision 0.32 0.02 0.0 0.00
1969 Atari 0.99 0.05 0.0 0.01
1766 Activision 1.07 0.07 0.0 0.01
… … … … … …
16307 Unknown 0.01 0.00 0.0 0.00
16327 Namco Bandai Games 0.01 0.00 0.0 0.00
16366 Unknown 0.01 0.00 0.0 0.00
16427 Unknown 0.01 0.00 0.0 0.00
16493 Unknown 0.00 0.01 0.0 0.00

Global_Sales
6896 0.24
2669 0.77
5366 0.34
1969 1.05
1766 1.15
… …
16307 0.01
16327 0.01
16366 0.01
16427 0.01
16493 0.01

[16598 rows x 11 columns]

[18]: data.sort_values(["Year","Name"])

[18]: Rank Name Platform Year \


258 259 Asteroids 2600 1980.0
2669 2671 Boxing 2600 1980.0
6317 6319 Bridge 2600 1980.0
6896 6898 Checkers 2600 1980.0
1969 1971 Defender 2600 1980.0
… … … … …
7351 7353 Yu Yu Hakusho: Dark Tournament PS2 NaN
15476 15479 Yu-Gi-Oh! 5D's Wheelie Breakers (JP sales) Wii NaN
11409 11411 Zero: Tsukihami no Kamen Wii NaN
8899 8901 eJay Clubworld PS2 NaN
470 471 wwe Smackdown vs. Raw 2006 PS2 NaN

Genre Publisher NA_Sales EU_Sales JP_Sales \


258 Shooter Atari 4.00 0.26 0.00
2669 Fighting Activision 0.72 0.04 0.00
6317 Misc Activision 0.25 0.02 0.00
6896 Misc Atari 0.22 0.01 0.00
1969 Misc Atari 0.99 0.05 0.00
… … … … … …

30
7351 Fighting NaN 0.10 0.08 0.00
15476 Racing Unknown 0.00 0.00 0.02
11409 Action Nintendo 0.00 0.00 0.08
8899 Misc Empire Interactive 0.07 0.06 0.00
470 Fighting NaN 1.57 1.02 0.00

Other_Sales Global_Sales
258 0.05 4.31
2669 0.01 0.77
6317 0.00 0.27
6896 0.00 0.24
1969 0.01 1.05
… … …
7351 0.03 0.21
15476 0.00 0.02
11409 0.00 0.08
8899 0.02 0.15
470 0.41 3.00

[16598 rows x 11 columns]

from file: 09-Summary Statistics

9 Summarizing & Computing Descriptive Statistics


[1]: import pandas as pd
import numpy as np

[2]: df=pd.DataFrame(
[[2.4,np.nan],[6.3,-5.4],
[np.nan,np.nan],[0.75,-1.3]],
index=["a","b","c","d"],
columns=["one","two"])
df

[2]: one two


a 2.40 NaN
b 6.30 -5.4
c NaN NaN
d 0.75 -1.3

[3]: df.sum()

[3]: one 9.45


two -6.70
dtype: float64

31
[4]: df.sum(axis=1)

[4]: a 2.40
b 0.90
c 0.00
d -0.55
dtype: float64

[5]: df.mean(axis=1)

[5]: a 2.400
b 0.450
c NaN
d -0.275
dtype: float64

[6]: df.mean(axis=1,skipna=False)

[6]: a NaN
b 0.450
c NaN
d -0.275
dtype: float64

[7]: df.idxmax()

[7]: one b
two d
dtype: object

[8]: df.idxmin()

[8]: one d
two b
dtype: object

[9]: df.cumsum()

[9]: one two


a 2.40 NaN
b 8.70 -5.4
c NaN NaN
d 9.45 -6.7

[10]: df.describe()

32
[10]: one two
count 3.000 2.000000
mean 3.150 -3.350000
std 2.850 2.899138
min 0.750 -5.400000
25% 1.575 -4.375000
50% 2.400 -3.350000
75% 4.350 -2.325000
max 6.300 -1.300000

To find the correlation coefficient, let’s first import the famous iris data set. You can download iris
data set from https://archive.ics.uci.edu/ml/datasets/iris.

[11]: iris=pd.read_csv("Data/iris.data",
sep=",",
header=None)

[12]: iris.head()

[12]: 0 1 2 3 4
0 5.1 3.5 1.4 0.2 Iris-setosa
1 4.9 3.0 1.4 0.2 Iris-setosa
2 4.7 3.2 1.3 0.2 Iris-setosa
3 4.6 3.1 1.5 0.2 Iris-setosa
4 5.0 3.6 1.4 0.2 Iris-setosa

[13]: iris.columns=['sepal_length','sepal_width',
'petal_length','petal_width',
'class']

[14]: iris.head()

[14]: sepal_length sepal_width petal_length petal_width class


0 5.1 3.5 1.4 0.2 Iris-setosa
1 4.9 3.0 1.4 0.2 Iris-setosa
2 4.7 3.2 1.3 0.2 Iris-setosa
3 4.6 3.1 1.5 0.2 Iris-setosa
4 5.0 3.6 1.4 0.2 Iris-setosa

[15]: iris["sepal_length"].corr(iris["sepal_width"])

[15]: -0.10936924995064938

[16]: iris.corr()

[16]: sepal_length sepal_width petal_length petal_width


sepal_length 1.000000 -0.109369 0.871754 0.817954

33
sepal_width -0.109369 1.000000 -0.420516 -0.356544
petal_length 0.871754 -0.420516 1.000000 0.962757
petal_width 0.817954 -0.356544 0.962757 1.000000

[17]: iris.cov()

[17]: sepal_length sepal_width petal_length petal_width


sepal_length 0.685694 -0.039268 1.273682 0.516904
sepal_width -0.039268 0.188004 -0.321713 -0.117981
petal_length 1.273682 -0.321713 3.113179 1.296387
petal_width 0.516904 -0.117981 1.296387 0.582414

[18]: iris.corrwith(iris.petal_length)

[18]: sepal_length 0.871754


sepal_width -0.420516
petal_length 1.000000
petal_width 0.962757
dtype: float64

[19]: s=pd.Series(["b","b","b","b","c",
"c","a","a","a"])
s

[19]: 0 b
1 b
2 b
3 b
4 c
5 c
6 a
7 a
8 a
dtype: object

[20]: s.unique()

[20]: array(['b', 'c', 'a'], dtype=object)

[21]: s.value_counts()

[21]: b 4
a 3
c 2
dtype: int64

34
[22]: x=s.isin(["b","c"])
x

[22]: 0 True
1 True
2 True
3 True
4 True
5 True
6 False
7 False
8 False
dtype: bool

[23]: s[x]

[23]: 0 b
1 b
2 b
3 b
4 c
5 c
dtype: object

from file: 10-Reading and Writing Data

10 Data Reading & Writting in Pandas


[1]: import pandas as pd
import numpy as np

[2]: df=pd.read_table("Data/data.txt")

[3]: df2=pd.read_table("Data/data2.txt")
df2

[3]: Tom,80,M
0 Tim,85,M
1 Kim,70,M
2 Kate,90,F
3 Alex,75,F

[4]: df2=pd.read_table("Data/data2.txt", sep=",")


df2

35
[4]: Tom 80 M
0 Tim 85 M
1 Kim 70 M
2 Kate 90 F
3 Alex 75 F

[5]: df=pd.read_table("Data/data2.txt", sep=",")


df

[5]: Tom 80 M
0 Tim 85 M
1 Kim 70 M
2 Kate 90 F
3 Alex 75 F

[6]: df=pd.read_table("Data/data2.txt",
sep=",",
header=None)
df

[6]: 0 1 2
0 Tom 80 M
1 Tim 85 M
2 Kim 70 M
3 Kate 90 F
4 Alex 75 F

[7]: df=pd.read_table("Data/data2.txt",
sep=",",
header=None,
names=["name","score","sex"])
df

[7]: name score sex


0 Tom 80 M
1 Tim 85 M
2 Kim 70 M
3 Kate 90 F
4 Alex 75 F

[8]: df=pd.read_table("Data/data2.txt",
sep=",",header=None,
names=["name","score","sex"],
index_col="name")
df

36
[8]: score sex
name
Tom 80 M
Tim 85 M
Kim 70 M
Kate 90 F
Alex 75 F

[9]: df2=pd.read_table("Data/data3.txt",
sep=",")
df2

[9]: lesson name one two


0 Math Kim 80 85
1 Math Tim 90 70
2 Math Tom 70 95
3 Stat Kate 65 90
4 Stat Alex 85 80
5 Stat Sam 55 70

[10]: df2=pd.read_table("Data/data3.txt",
sep=",",
index_col=["lesson","name"])
df2

[10]: one two


lesson name
Math Kim 80 85
Tim 90 70
Tom 70 95
Stat Kate 65 90
Alex 85 80
Sam 55 70

[11]: df3=pd.read_table("Data/data4.txt",sep=",")
df3

[11]: #hello
name score sex
#scores of students NaN NaN
Tim 80 M
Kate 85 F
Alex 70 M
Tom 90 M
Kim 75 F

37
[12]: df3=pd.read_table("Data/data4.txt",
sep=",",
skiprows=[0,2])
df3

[12]: name score sex


0 Tim 80 M
1 Kate 85 F
2 Alex 70 M
3 Tom 90 M
4 Kim 75 F

[13]: df3=pd.read_table("Data/data4.txt",
sep=",",
skiprows=[0,2],
usecols=[0,1])
df3

[13]: name score


0 Tim 80
1 Kate 85
2 Alex 70
3 Tom 90
4 Kim 75

[14]: df3=pd.read_table("Data/data4.txt",
sep=",",
skiprows=[0,2],
usecols=[0,1],
nrows=3)
df3

[14]: name score


0 Tim 80
1 Kate 85
2 Alex 70

10.1 Writing Data


[15]: df=pd.read_csv("Data/data.txt",sep="\t")
df

[15]: name score sex


0 Tim 80 M
1 Tom 85 M
2 Kim 70 F
3 Sam 90 M

38
4 Efe 75 M

[16]: df.to_csv("Data/new_data.csv")

from file: 11-Missing Data

11 Missing Data in Pandas


[1]: import pandas as pd
import numpy as np

[2]: s=pd.Series(["Sam",np.nan,"Tim","Kim"])
s

[2]: 0 Sam
1 NaN
2 Tim
3 Kim
dtype: object

[3]: s.isnull()

[3]: 0 False
1 True
2 False
3 False
dtype: bool

[4]: s.notnull()

[4]: 0 True
1 False
2 True
3 True
dtype: bool

[5]: s[3]=None
s.isnull()

[5]: 0 False
1 True
2 False
3 True
dtype: bool

[6]: s.dropna()

39
[6]: 0 Sam
2 Tim
dtype: object

[7]: from numpy import nan as NA

[8]: df=pd.DataFrame([[1,2,3],[4,NA,5],
[NA,NA,NA]])
df

[8]: 0 1 2
0 1.0 2.0 3.0
1 4.0 NaN 5.0
2 NaN NaN NaN

[9]: df.dropna()

[9]: 0 1 2
0 1.0 2.0 3.0

[10]: df.dropna(how="all")

[10]: 0 1 2
0 1.0 2.0 3.0
1 4.0 NaN 5.0

[11]: df

[11]: 0 1 2
0 1.0 2.0 3.0
1 4.0 NaN 5.0
2 NaN NaN NaN

[12]: df[1]=NA
df

[12]: 0 1 2
0 1.0 NaN 3.0
1 4.0 NaN 5.0
2 NaN NaN NaN

[13]: df.dropna(axis=1,how="all")

[13]: 0 2
0 1.0 3.0
1 4.0 5.0
2 NaN NaN

40
[14]: df

[14]: 0 1 2
0 1.0 NaN 3.0
1 4.0 NaN 5.0
2 NaN NaN NaN

[15]: df.dropna(thresh=3)
df

[15]: 0 1 2
0 1.0 NaN 3.0
1 4.0 NaN 5.0
2 NaN NaN NaN

[16]: df.fillna(0)

[16]: 0 1 2
0 1.0 0.0 3.0
1 4.0 0.0 5.0
2 0.0 0.0 0.0

[17]: df.fillna({0:15,1:25,2:35})

[17]: 0 1 2
0 1.0 25.0 3.0
1 4.0 25.0 5.0
2 15.0 25.0 35.0

[18]: df

[18]: 0 1 2
0 1.0 NaN 3.0
1 4.0 NaN 5.0
2 NaN NaN NaN

[19]: df.fillna(0,inplace=True)
df

[19]: 0 1 2
0 1.0 0.0 3.0
1 4.0 0.0 5.0
2 0.0 0.0 0.0

[20]: df=pd.DataFrame([[1,2,3],[4,NA,5],
[NA,NA,NA]])
df

41
[20]: 0 1 2
0 1.0 2.0 3.0
1 4.0 NaN 5.0
2 NaN NaN NaN

[21]: df.fillna(method="ffill")

[21]: 0 1 2
0 1.0 2.0 3.0
1 4.0 2.0 5.0
2 4.0 2.0 5.0

[22]: df.fillna(method="ffill",limit=1)

[22]: 0 1 2
0 1.0 2.0 3.0
1 4.0 2.0 5.0
2 4.0 NaN 5.0

[23]: data=pd.Series([1,0,NA,5])
data

[23]: 0 1.0
1 0.0
2 NaN
3 5.0
dtype: float64

[24]: data.fillna(data.mean())

[24]: 0 1.0
1 0.0
2 2.0
3 5.0
dtype: float64

[25]: df

[25]: 0 1 2
0 1.0 2.0 3.0
1 4.0 NaN 5.0
2 NaN NaN NaN

[26]: df.fillna(df.mean())

[26]: 0 1 2
0 1.0 2.0 3.0

42
1 4.0 2.0 5.0
2 2.5 2.0 4.0

from file: 12-Data Transformation

12 Data Transformation in Pandas


[1]: import pandas as pd

[2]: data=pd.DataFrame({"a":["one","two"]*3,
"b":[1,1,2,3,2,3]})
data

[2]: a b
0 one 1
1 two 1
2 one 2
3 two 3
4 one 2
5 two 3

[3]: data.duplicated()

[3]: 0 False
1 False
2 False
3 False
4 True
5 True
dtype: bool

[4]: data.drop_duplicates()

[4]: a b
0 one 1
1 two 1
2 one 2
3 two 3

[5]: data["c"]=range(6)
data

[5]: a b c
0 one 1 0
1 two 1 1
2 one 2 2
3 two 3 3

43
4 one 2 4
5 two 3 5

[6]: data.duplicated(["a","b"],keep="last")

[6]: 0 False
1 False
2 True
3 True
4 False
5 False
dtype: bool

[7]: df=pd.DataFrame({"names":["Tim","tom","Sam",
"kate","Kim"],
"scores":[60,50,70,80,40]})
df

[7]: names scores


0 Tim 60
1 tom 50
2 Sam 70
3 kate 80
4 Kim 40

[8]: classes={"Tim":"A","Tom":"A","Sam":"B",
"Kate":"B","Kim":"B"}

[9]: n=df["names"].str.capitalize()

[10]: df["branches"]=n.map(classes)

[11]: df

[11]: names scores branches


0 Tim 60 A
1 tom 50 A
2 Sam 70 B
3 kate 80 B
4 Kim 40 B

[12]: s=pd.Series([80,70,90,60])
s

[12]: 0 80
1 70
2 90

44
3 60
dtype: int64

[13]: import numpy as np

[14]: s.replace(70,np.nan)

[14]: 0 80.0
1 NaN
2 90.0
3 60.0
dtype: float64

[15]: s.replace([70,60],[np.nan,0])

[15]: 0 80.0
1 NaN
2 90.0
3 0.0
dtype: float64

[16]: s.replace({90:100,60:0})

[16]: 0 80
1 70
2 100
3 0
dtype: int64

[17]: df=pd.DataFrame(
np.arange(12).reshape(3,4),
index=[0,1,2],
columns=["tim","tom","kim","sam"])
df

[17]: tim tom kim sam


0 0 1 2 3
1 4 5 6 7
2 8 9 10 11

[18]: s=pd.Series(["one","two","three"])
df.index=df.index.map(s)

[19]: df

[19]: tim tom kim sam


one 0 1 2 3

45
two 4 5 6 7
three 8 9 10 11

[20]: df.rename(index=str.title,columns=str.upper)

[20]: TIM TOM KIM SAM


One 0 1 2 3
Two 4 5 6 7
Three 8 9 10 11

[21]: df.rename(index={"one":"ten"},
columns={"sam":"kate"},
inplace=True)
df

[21]: tim tom kim kate


ten 0 1 2 3
two 4 5 6 7
three 8 9 10 11

[22]: sc=[30,80,40,90,60,45,95,75,55,100,65,85]

[23]: x=[20,40,60,80,100]

[24]: y=pd.cut(sc,x)
y

[24]: [(20, 40], (60, 80], (20, 40], (80, 100], (40, 60], …, (60, 80], (40, 60],
(80, 100], (60, 80], (80, 100]]
Length: 12
Categories (4, interval[int64]): [(20, 40] < (40, 60] < (60, 80] < (80, 100]]

[25]: y.codes

[25]: array([0, 2, 0, 3, 1, 1, 3, 2, 1, 3, 2, 3], dtype=int8)

[26]: y.categories

[26]: IntervalIndex([(20, 40], (40, 60], (60, 80], (80, 100]],


closed='right',
dtype='interval[int64]')

[27]: pd.value_counts(y)

[27]: (80, 100] 4


(60, 80] 3
(40, 60] 3

46
(20, 40] 2
dtype: int64

[28]: y=pd.cut(sc,x,right=False)
y

[28]: [[20, 40), [80, 100), [40, 60), [80, 100), [60, 80), …, [60.0, 80.0), [40.0,
60.0), NaN, [60.0, 80.0), [80.0, 100.0)]
Length: 12
Categories (4, interval[int64]): [[20, 40) < [40, 60) < [60, 80) < [80, 100)]

[29]: nm=["low", "medium", "high", "very high"]


pd.cut(sc,x,labels=nm)

[29]: ['low', 'high', 'low', 'very high', 'medium', …, 'high', 'medium', 'very
high', 'high', 'very high']
Length: 12
Categories (4, object): ['low' < 'medium' < 'high' < 'very high']

[30]: pd.cut(sc,10)

[30]: [(29.93, 37.0], (79.0, 86.0], (37.0, 44.0], (86.0, 93.0], (58.0, 65.0], …,
(72.0, 79.0], (51.0, 58.0], (93.0, 100.0], (58.0, 65.0], (79.0, 86.0]]
Length: 12
Categories (10, interval[float64]): [(29.93, 37.0] < (37.0, 44.0] < (44.0, 51.0]
< (51.0, 58.0] … (72.0, 79.0] < (79.0, 86.0] < (86.0, 93.0] < (93.0, 100.0]]

[31]: data=np.random.randn(100)
c=pd.qcut(data,4)
c

[31]: [(-0.135, 0.631], (0.631, 2.286], (-0.629, -0.135], (0.631, 2.286], (-2.186,
-0.629], …, (-0.629, -0.135], (0.631, 2.286], (0.631, 2.286], (-0.135, 0.631],
(-0.629, -0.135]]
Length: 100
Categories (4, interval[float64]): [(-2.186, -0.629] < (-0.629, -0.135] <
(-0.135, 0.631] < (0.631, 2.286]]

[32]: pd.value_counts(c)

[32]: (0.631, 2.286] 25


(-0.135, 0.631] 25
(-0.629, -0.135] 25
(-2.186, -0.629] 25
dtype: int64

47
[33]: data=pd.DataFrame(np.random.randn(1000,4))
data.head()

[33]: 0 1 2 3
0 0.661162 -1.315550 0.138893 -2.186859
1 -0.422096 0.587658 -0.478577 -0.285737
2 -0.283092 -0.021623 1.194335 -0.197599
3 -1.545286 -0.219977 0.353704 0.424970
4 -0.196521 3.491917 0.016217 -0.464119

[34]: data.describe()

[34]: 0 1 2 3
count 1000.000000 1000.000000 1000.000000 1000.000000
mean -0.048690 0.030168 -0.050958 -0.004881
std 0.991747 0.996656 0.992911 1.021762
min -2.999425 -3.514219 -2.864107 -3.186480
25% -0.723203 -0.622904 -0.726046 -0.752119
50% -0.058605 0.027484 -0.036939 0.023319
75% 0.560961 0.654890 0.605512 0.739495
max 3.952527 3.491917 3.074154 3.002287

[35]: col=data[1]

[36]: col[np.abs(col)>3]

[36]: 4 3.491917
117 3.098920
311 -3.067095
903 -3.514219
Name: 1, dtype: float64

[37]: data[(np.abs(data)>3).any(1)]

[37]: 0 1 2 3
4 -0.196521 3.491917 0.016217 -0.464119
117 -1.360727 3.098920 -0.902404 -1.874759
311 -1.408380 -3.067095 -0.034621 -1.377992
492 -0.508447 -0.264550 -2.246989 -3.096799
510 3.952527 -0.307437 1.160524 1.022866
547 -0.146700 -0.594890 3.074154 -0.198825
642 3.116662 -0.466845 -0.543486 0.038652
867 0.222411 -0.131135 0.618451 3.002287
903 -0.917298 -3.514219 -2.500715 0.259489
956 0.115918 1.314808 -0.220663 -3.186480

[38]: np.sign(data).head()

48
[38]: 0 1 2 3
0 1.0 -1.0 1.0 -1.0
1 -1.0 1.0 -1.0 -1.0
2 -1.0 -1.0 1.0 -1.0
3 -1.0 -1.0 1.0 1.0
4 -1.0 1.0 1.0 -1.0

[39]: data=pd.DataFrame(
np.arange(12).reshape(4,3))
data

[39]: 0 1 2
0 0 1 2
1 3 4 5
2 6 7 8
3 9 10 11

[40]: rw=np.random.permutation(4)
rw

[40]: array([2, 3, 0, 1])

[41]: data.take(rw)

[41]: 0 1 2
2 6 7 8
3 9 10 11
0 0 1 2
1 3 4 5

[42]: data.sample()

[42]: 0 1 2
3 9 10 11

[43]: data.sample(n=2)

[43]: 0 1 2
0 0 1 2
2 6 7 8

12.1 Dummy Variable


[44]: data=pd.DataFrame(
{"letter":["c","b","a","b","b","a"],
"number":range(6)})
data

49
[44]: letter number
0 c 0
1 b 1
2 a 2
3 b 3
4 b 4
5 a 5

[45]: pd.get_dummies(data["letter"])

[45]: a b c
0 0 0 1
1 0 1 0
2 1 0 0
3 0 1 0
4 0 1 0
5 1 0 0

[46]: data=np.random.randn(10)
data

[46]: array([-0.47399816, -0.4700948 , -0.10090975, -0.49105714, 0.85748633,


-2.17384891, -1.89062041, 1.15155524, -1.12043372, 0.82935199])

[47]: pd.get_dummies(pd.cut(data,4))

[47]: (-2.177, -1.342] (-1.342, -0.511] (-0.511, 0.32] (0.32, 1.152]


0 0 0 1 0
1 0 0 1 0
2 0 0 1 0
3 0 0 1 0
4 0 0 0 1
5 1 0 0 0
6 1 0 0 0
7 0 0 0 1
8 0 1 0 0
9 0 0 0 1

from file: 13-Hierarchical Indexing

13 Hierarchical Indexing in Pandas


[1]: import numpy as np
import pandas as pd

50
[2]: data=pd.Series(np.random.randn(8),
index=[["a","a","a","b",
"b","b","c","c"],
[1,2,3,1,2,3,1,2]])
data

[2]: a 1 0.022235
2 0.007393
3 -3.081152
b 1 -0.673017
2 -0.034024
3 0.679701
c 1 1.175051
2 0.916181
dtype: float64

[3]: data.index

[3]: MultiIndex([('a', 1),


('a', 2),
('a', 3),
('b', 1),
('b', 2),
('b', 3),
('c', 1),
('c', 2)],
)

[4]: data["a"]

[4]: 1 0.022235
2 0.007393
3 -3.081152
dtype: float64

[5]: data["b":"c"]

[5]: b 1 -0.673017
2 -0.034024
3 0.679701
c 1 1.175051
2 0.916181
dtype: float64

[6]: data.loc[["a","c"]]

51
[6]: a 1 0.022235
2 0.007393
3 -3.081152
c 1 1.175051
2 0.916181
dtype: float64

[7]: data.loc[:,1]

[7]: a 0.022235
b -0.673017
c 1.175051
dtype: float64

[8]: data.unstack()

[8]: 1 2 3
a 0.022235 0.007393 -3.081152
b -0.673017 -0.034024 0.679701
c 1.175051 0.916181 NaN

[9]: data.unstack().stack()

[9]: a 1 0.022235
2 0.007393
3 -3.081152
b 1 -0.673017
2 -0.034024
3 0.679701
c 1 1.175051
2 0.916181
dtype: float64

[10]: df=pd.DataFrame(
np.arange(12).reshape(4,3),
index=[["a","a","b","b"],
[1,2,1,2]],
columns=[["num","num","ver"],
["math","stat","geo"]])
df

[10]: num ver


math stat geo
a 1 0 1 2
2 3 4 5
b 1 6 7 8
2 9 10 11

52
[11]: df.index.names=["class","exam"]
df.columns.names=["field","lesson"]
df

[11]: field num ver


lesson math stat geo
class exam
a 1 0 1 2
2 3 4 5
b 1 6 7 8
2 9 10 11

[12]: df["num"]

[12]: lesson math stat


class exam
a 1 0 1
2 3 4
b 1 6 7
2 9 10

[13]: df.swaplevel("class","exam")

[13]: field num ver


lesson math stat geo
exam class
1 a 0 1 2
2 a 3 4 5
1 b 6 7 8
2 b 9 10 11

[14]: df.sort_index(level=1)

[14]: field num ver


lesson math stat geo
class exam
a 1 0 1 2
b 1 6 7 8
a 2 3 4 5
b 2 9 10 11

13.1 Summary Statistics by Level


[15]: df.sum(level="exam")

[15]: field num ver


lesson math stat geo

53
exam
1 6 8 10
2 12 14 16

[16]: df.sum(level="field",axis=1)

[16]: field num ver


class exam
a 1 1 2
2 7 5
b 1 13 8
2 19 11

13.2 Indexing with a DataFrame’s columns


[17]: data=pd.DataFrame(
{"x":range(8),"y":range(8,0,-1),
"a":["one","one","one","one","two",
"two","two","two"],
"b":[0,1,2,3,0,1,2,3]})
data

[17]: x y a b
0 0 8 one 0
1 1 7 one 1
2 2 6 one 2
3 3 5 one 3
4 4 4 two 0
5 5 3 two 1
6 6 2 two 2
7 7 1 two 3

[18]: data2=data.set_index(["a","b"])
data2

[18]: x y
a b
one 0 0 8
1 1 7
2 2 6
3 3 5
two 0 4 4
1 5 3
2 6 2
3 7 1

54
[19]: data3=data.set_index(["a","b"],drop=False)
data3

[19]: x y a b
a b
one 0 0 8 one 0
1 1 7 one 1
2 2 6 one 2
3 3 5 one 3
two 0 4 4 two 0
1 5 3 two 1
2 6 2 two 2
3 7 1 two 3

[20]: data2

[20]: x y
a b
one 0 0 8
1 1 7
2 2 6
3 3 5
two 0 4 4
1 5 3
2 6 2
3 7 1

[21]: data2.reset_index()

[21]: a b x y
0 one 0 0 8
1 one 1 1 7
2 one 2 2 6
3 one 3 3 5
4 two 0 4 4
5 two 1 5 3
6 two 2 6 2
7 two 3 7 1

from file: 14-Combining and Merging Datasets

14 Combining & Merging Datasets in Pandas


[1]: import pandas as pd
import numpy as np

55
14.1 Joining DataFrame
[2]: d1=pd.DataFrame(
{"key":["a","b","c","c","d","e"],
"num1":range(6)})
d2=pd.DataFrame(
{"key":["b","c","e","f"],
"num2":range(4)})

[3]: print(d1)
print(d2)

key num1
0 a 0
1 b 1
2 c 2
3 c 3
4 d 4
5 e 5
key num2
0 b 0
1 c 1
2 e 2
3 f 3

[4]: pd.merge(d1, d2)

[4]: key num1 num2


0 b 1 0
1 c 2 1
2 c 3 1
3 e 5 2

[5]: pd.merge(d1, d2, on='key')

[5]: key num1 num2


0 b 1 0
1 c 2 1
2 c 3 1
3 e 5 2

[6]: d3=pd.DataFrame(
{"key1":["a","b","c","c","d","e"],
"num1":range(6)})
d4=pd.DataFrame(
{"key2":["b","c","e","f"],
"num2":range(4)})

56
[7]: pd.merge(
d3,d4,left_on="key1",right_on="key2"
)

[7]: key1 num1 key2 num2


0 b 1 b 0
1 c 2 c 1
2 c 3 c 1
3 e 5 e 2

[8]: pd.merge(d1,d2,how="outer")

[8]: key num1 num2


0 a 0.0 NaN
1 b 1.0 0.0
2 c 2.0 1.0
3 c 3.0 1.0
4 d 4.0 NaN
5 e 5.0 2.0
6 f NaN 3.0

[9]: pd.merge(d1,d2,how="left")

[9]: key num1 num2


0 a 0 NaN
1 b 1 0.0
2 c 2 1.0
3 c 3 1.0
4 d 4 NaN
5 e 5 2.0

[10]: pd.merge(d1,d2,how="right")

[10]: key num1 num2


0 b 1.0 0
1 c 2.0 1
2 c 3.0 1
3 e 5.0 2
4 f NaN 3

[11]: pd.merge(d1, d2, how='inner')

[11]: key num1 num2


0 b 1 0
1 c 2 1
2 c 3 1
3 e 5 2

57
[12]: df1=pd.DataFrame(
{"key":["a","b","c","c","d","e"],
"num1":range(6),
"count":["one","three","two",
"one","one","two"]})
df2=pd.DataFrame(
{"key":["b","c","e","f"],
"num2":range(4),
"count":["one","two","two","two"]})

[13]: pd.merge(df1, df2, on=['key', 'count'],


how='outer')

[13]: key num1 count num2


0 a 0.0 one NaN
1 b 1.0 three NaN
2 c 2.0 two 1.0
3 c 3.0 one NaN
4 d 4.0 one NaN
5 e 5.0 two 2.0
6 b NaN one 0.0
7 f NaN two 3.0

[14]: pd.merge(df1, df2, on="key", how='outer')

[14]: key num1 count_x num2 count_y


0 a 0.0 one NaN NaN
1 b 1.0 three 0.0 one
2 c 2.0 two 1.0 two
3 c 3.0 one 1.0 two
4 d 4.0 one NaN NaN
5 e 5.0 two 2.0 two
6 f NaN NaN 3.0 two

[15]: pd.merge(df1, df2,


on='key',
suffixes=('_data1', '_data2'))

[15]: key num1 count_data1 num2 count_data2


0 b 1 three 0 one
1 c 2 two 1 two
2 c 3 one 1 two
3 e 5 two 2 two

58
14.2 Merging on index
[16]: df1=pd.DataFrame(
{"letter":["a","a","b",
"b","a","c"],
"num":range(6)})
df2=pd.DataFrame(
{"value":[3,5,7]},
index=["a","b","e"])

[17]: print(df1)
print(df2)

letter num
0 a 0
1 a 1
2 b 2
3 b 3
4 a 4
5 c 5
value
a 3
b 5
e 7

[18]: pd.merge(df1,df2,
left_on="letter",
right_index=True)

[18]: letter num value


0 a 0 3
1 a 1 3
4 a 4 3
2 b 2 5
3 b 3 5

[19]: right=pd.DataFrame(
[[1,2],[3,4],[5,6]],
index=["a","c","d"],
columns=["Tom","Tim"])
left=pd.DataFrame(
[[7,8],[9,10],[11,12],[13,14]],
index=["a","b","e","f"],
columns=["Sam","Kim"])

[20]: pd.merge(right,left,
right_index=True,

59
left_index=True,
how="outer")

[20]: Tom Tim Sam Kim


a 1.0 2.0 7.0 8.0
b NaN NaN 9.0 10.0
c 3.0 4.0 NaN NaN
d 5.0 6.0 NaN NaN
e NaN NaN 11.0 12.0
f NaN NaN 13.0 14.0

[21]: left.join(right)

[21]: Sam Kim Tom Tim


a 7 8 1.0 2.0
b 9 10 NaN NaN
e 11 12 NaN NaN
f 13 14 NaN NaN

[22]: left.join(right,how="outer")

[22]: Sam Kim Tom Tim


a 7.0 8.0 1.0 2.0
b 9.0 10.0 NaN NaN
c NaN NaN 3.0 4.0
d NaN NaN 5.0 6.0
e 11.0 12.0 NaN NaN
f 13.0 14.0 NaN NaN

[23]: data=pd.DataFrame([[1,3],[5,7],[9,11]],
index=["a","b","f"],
columns=["Alex","Keta"])
left.join([right,data])

[23]: Sam Kim Tom Tim Alex Keta


a 7.0 8.0 1.0 2.0 1.0 3.0
b 9.0 10.0 NaN NaN 5.0 7.0
e 11.0 12.0 NaN NaN NaN NaN
f 13.0 14.0 NaN NaN 9.0 11.0

14.3 Concatenating Along an Axis


[24]: seq= np.arange(20).reshape((4, 5))

[25]: np.concatenate([seq,seq], axis=1)

60
[25]: array([[ 0, 1, 2, 3, 4, 0, 1, 2, 3, 4],
[ 5, 6, 7, 8, 9, 5, 6, 7, 8, 9],
[10, 11, 12, 13, 14, 10, 11, 12, 13, 14],
[15, 16, 17, 18, 19, 15, 16, 17, 18, 19]])

[26]: np.concatenate([seq, seq], axis=0)

[26]: array([[ 0, 1, 2, 3, 4],


[ 5, 6, 7, 8, 9],
[10, 11, 12, 13, 14],
[15, 16, 17, 18, 19],
[ 0, 1, 2, 3, 4],
[ 5, 6, 7, 8, 9],
[10, 11, 12, 13, 14],
[15, 16, 17, 18, 19]])

[27]: data1 = pd.Series(


[0, 1], index=['a', 'b'])
data2 = pd.Series(
[2,3,4], index=['c','d','e'])
data3 = pd.Series(
[5, 6], index=['f', 'g'])

[28]: pd.concat([data1,data2,data3])

[28]: a 0
b 1
c 2
d 3
e 4
f 5
g 6
dtype: int64

[29]: pd.concat([data1, data2, data3], axis=1)

[29]: 0 1 2
a 0.0 NaN NaN
b 1.0 NaN NaN
c NaN 2.0 NaN
d NaN 3.0 NaN
e NaN 4.0 NaN
f NaN NaN 5.0
g NaN NaN 6.0

[30]: data4= pd.Series([10,11,12],


index=['a','b',"c"])

61
pd.concat([data1,data4],axis=1,join="inner")

[30]: 0 1
a 0 10
b 1 11

[31]: x=pd.concat([data1, data2, data4],


keys=['one', 'two','three'])
x

[31]: one a 0
b 1
two c 2
d 3
e 4
three a 10
b 11
c 12
dtype: int64

[32]: x=pd.concat([data1, data2, data4],


axis=1,
keys=['one', 'two', 'three'])
x

[32]: one two three


a 0.0 NaN 10.0
b 1.0 NaN 11.0
c NaN 2.0 12.0
d NaN 3.0 NaN
e NaN 4.0 NaN

[33]: df1 = pd.DataFrame(


np.arange(6).reshape(3, 2),
index=['a', 'b', 'c'],
columns=['one', 'two'])
df2 = pd.DataFrame(
10+np.arange(4).reshape(2,2),
index=['a', 'c'],
columns=['three', 'four'])

[34]: pd.concat([df1, df2], axis=1,


keys=['s1', 's2'],
sort=False)

[34]: s1 s2
one two three four

62
a 0 1 10.0 11.0
b 2 3 NaN NaN
c 4 5 12.0 13.0

[35]: data1 = pd.DataFrame(


np.random.randn(3, 4),
columns=['a','b','c','d'])
data2 = pd.DataFrame(
np.random.randn(2, 3),
columns=['b','d','a'])

[36]: pd.concat([data1, data2], ignore_index=True)

[36]: a b c d
0 0.443128 1.033878 -0.081062 0.720712
1 1.249823 1.695462 -1.911692 -2.135979
2 0.970119 0.152867 0.210750 0.736984
3 -0.930846 -1.478824 NaN 0.084256
4 -0.420467 1.158122 NaN 0.501372

from file: 15-Reshaping and Pivoting

15 Reshaping & Pivoting in Pandas


[1]: import pandas as pd
import numpy as np

[2]: data=pd.DataFrame(
np.arange(16).reshape(4,4),
index=[list("aabb"),[1,2]*2],
columns=[["num","num",
"comp","comp"],
["math","stat"]*2])
data

[2]: num comp


math stat math stat
a 1 0 1 2 3
2 4 5 6 7
b 1 8 9 10 11
2 12 13 14 15

[3]: data.index.names=["class","exam"]
data.columns.names=["field","lesson"]
data

63
[3]: field num comp
lesson math stat math stat
class exam
a 1 0 1 2 3
2 4 5 6 7
b 1 8 9 10 11
2 12 13 14 15

[4]: long=data.stack()

[5]: long

[5]: field comp num


class exam lesson
a 1 math 2 0
stat 3 1
2 math 6 4
stat 7 5
b 1 math 10 8
stat 11 9
2 math 14 12
stat 15 13

[6]: long.unstack()

[6]: field comp num


lesson math stat math stat
class exam
a 1 2 3 0 1
2 6 7 4 5
b 1 10 11 8 9
2 14 15 12 13

[7]: data.stack()

[7]: field comp num


class exam lesson
a 1 math 2 0
stat 3 1
2 math 6 4
stat 7 5
b 1 math 10 8
stat 11 9
2 math 14 12
stat 15 13

[8]: data.stack(0)

64
[8]: lesson math stat
class exam field
a 1 comp 2 3
num 0 1
2 comp 6 7
num 4 5
b 1 comp 10 11
num 8 9
2 comp 14 15
num 12 13

[9]: data.stack("field")

[9]: lesson math stat


class exam field
a 1 comp 2 3
num 0 1
2 comp 6 7
num 4 5
b 1 comp 10 11
num 8 9
2 comp 14 15
num 12 13

[10]: s1=pd.Series(
np.arange(4),index=list("abcd"))
s2=pd.Series(
np.arange(6,9),index=list("cde"))

[11]: print(s1)
print(s2)

a 0
b 1
c 2
d 3
dtype: int32
c 6
d 7
e 8
dtype: int32

[12]: data2=pd.concat([s1,s2],keys=["bir","iki"])
data2

[12]: bir a 0
b 1

65
c 2
d 3
iki c 6
d 7
e 8
dtype: int32

[13]: data2.unstack()

[13]: a b c d e
bir 0.0 1.0 2.0 3.0 NaN
iki NaN NaN 6.0 7.0 8.0

[14]: data2.unstack().stack(dropna=False)

[14]: bir a 0.0


b 1.0
c 2.0
d 3.0
e NaN
iki a NaN
b NaN
c 6.0
d 7.0
e 8.0
dtype: float64

[15]: data2.unstack().stack(dropna=False)

[15]: bir a 0.0


b 1.0
c 2.0
d 3.0
e NaN
iki a NaN
b NaN
c 6.0
d 7.0
e 8.0
dtype: float64

15.1 Pivoting “Long” to “Wide” Format


[16]: stock=pd.DataFrame(
{"fruit": ["apple", "plum","grape"]*2,
"color": ["purple","yellow"]*3,
"piece":[3,4,5,6,1,2]})

66
[17]: stock

[17]: fruit color piece


0 apple purple 3
1 plum yellow 4
2 grape purple 5
3 apple yellow 6
4 plum purple 1
5 grape yellow 2

[18]: stock.pivot("fruit", "color", "piece")

[18]: color purple yellow


fruit
apple 3 6
grape 5 2
plum 1 4

[19]: stock["value"]=np.random.randn(len(stock))

[20]: stock

[20]: fruit color piece value


0 apple purple 3 -0.038716
1 plum yellow 4 -0.069972
2 grape purple 5 -1.116665
3 apple yellow 6 0.374715
4 plum purple 1 -0.023233
5 grape yellow 2 0.608953

[21]: p=stock.pivot("fruit","color")
p

[21]: piece value


color purple yellow purple yellow
fruit
apple 3 6 -0.038716 0.374715
grape 5 2 -1.116665 0.608953
plum 1 4 -0.023233 -0.069972

[22]: p["value"]

[22]: color purple yellow


fruit
apple -0.038716 0.374715
grape -1.116665 0.608953
plum -0.023233 -0.069972

67
15.2 Pivoting “Wide” to “Long” Format
[23]: data=pd.DataFrame(
{"lesson":["math","stat","bio"],
"Sam":[50,60,70],
"Kim":[80,70,90],
"Tom":[60,70,85]})
data

[23]: lesson Sam Kim Tom


0 math 50 80 60
1 stat 60 70 70
2 bio 70 90 85

[24]: group=pd.melt(data,["lesson"])

[25]: group

[25]: lesson variable value


0 math Sam 50
1 stat Sam 60
2 bio Sam 70
3 math Kim 80
4 stat Kim 70
5 bio Kim 90
6 math Tom 60
7 stat Tom 70
8 bio Tom 85

[26]: data=group.pivot(
"lesson","variable","value")
data

[26]: variable Kim Sam Tom


lesson
bio 90 70 85
math 80 50 60
stat 70 60 70

[27]: data.reset_index()

[27]: variable lesson Kim Sam Tom


0 bio 90 70 85
1 math 80 50 60
2 stat 70 60 70

from file: 16-GroupBy

68
16 What is Groupby in Pandas?
[1]: import pandas as pd
import numpy as np

[2]: df=pd.DataFrame(
{"key1":list("aabbab"),
"key2":["one","two","three"]*2,
"data1":np.random.randn(6),
"data2":np.random.randn(6)})
df

[2]: key1 key2 data1 data2


0 a one 0.128979 0.903436
1 a two -0.334460 -1.431566
2 b three -0.506455 -0.854207
3 b one 2.135132 -0.996191
4 a two -0.979153 1.918519
5 b three -0.165257 0.204901

[3]: group=df["data1"].groupby(df["key1"])

[4]: group

[4]: <pandas.core.groupby.generic.SeriesGroupBy object at 0x000001EE83397370>

[5]: group.mean()

[5]: key1
a -0.394878
b 0.487807
Name: data1, dtype: float64

[6]: ave=df["data1"].groupby([df["key1"],
df["key2"]]).mean()
ave

[6]: key1 key2


a one 0.128979
two -0.656807
b one 2.135132
three -0.335856
Name: data1, dtype: float64

[7]: ave.unstack()

69
[7]: key2 one three two
key1
a 0.128979 NaN -0.656807
b 2.135132 -0.335856 NaN

[8]: df.groupby("key1").mean()

[8]: data1 data2


key1
a -0.394878 0.463463
b 0.487807 -0.548499

[9]: df.groupby(["key1","key2"]).mean()

[9]: data1 data2


key1 key2
a one 0.128979 0.903436
two -0.656807 0.243476
b one 2.135132 -0.996191
three -0.335856 -0.324653

16.1 Iterating over Groups


[10]: for name, group in df.groupby("key1"):
print(name)
print(group)

a
key1 key2 data1 data2
0 a one 0.128979 0.903436
1 a two -0.334460 -1.431566
4 a two -0.979153 1.918519
b
key1 key2 data1 data2
2 b three -0.506455 -0.854207
3 b one 2.135132 -0.996191
5 b three -0.165257 0.204901

[11]: for (x1,x2),group in df.groupby(["key1",


"key2"]):
print(x1,x2)
print(group)

a one
key1 key2 data1 data2
0 a one 0.128979 0.903436
a two

70
key1 key2 data1 data2
1 a two -0.334460 -1.431566
4 a two -0.979153 1.918519
b one
key1 key2 data1 data2
3 b one 2.135132 -0.996191
b three
key1 key2 data1 data2
2 b three -0.506455 -0.854207
5 b three -0.165257 0.204901

[12]: piece=dict(list(df.groupby("key1")))

[13]: piece["a"]

[13]: key1 key2 data1 data2


0 a one 0.128979 0.903436
1 a two -0.334460 -1.431566
4 a two -0.979153 1.918519

16.2 Selecting a Column or Subset of Columns


[14]: df.groupby(['key1',
'key2'])[['data1']].mean()

[14]: data1
key1 key2
a one 0.128979
two -0.656807
b one 2.135132
three -0.335856

16.3 Grouping with Dicts and Series


[15]: fruit=pd.DataFrame(np.random.randn(4,4),
columns=list("abcd"),
index=["apple","cherry",
"banana","kiwi"])
fruit

[15]: a b c d
apple 0.803066 0.165556 0.040465 -0.376024
cherry -0.265198 0.778739 0.574622 -1.292316
banana -0.977442 -0.458472 -1.271370 0.614398
kiwi 0.580412 0.061148 1.257117 -1.351419

71
[16]: label={"a": "green","b":"yellow",
"c":"green","d":"yellow",
"e":"purple"}

[17]: group=fruit.groupby(label,axis=1)

[18]: group.sum()

[18]: green yellow


apple 0.843531 -0.210469
cherry 0.309424 -0.513577
banana -2.248812 0.155926
kiwi 1.837529 -1.290271

[19]: s=pd.Series(label)
s

[19]: a green
b yellow
c green
d yellow
e purple
dtype: object

[20]: fruit.groupby(s,axis=1).count()

[20]: green yellow


apple 2 2
cherry 2 2
banana 2 2
kiwi 2 2

16.4 Grouping with Functions


[21]: fruit.groupby(len).sum()

[21]: a b c d
4 0.580412 0.061148 1.257117 -1.351419
5 0.803066 0.165556 0.040465 -0.376024
6 -1.242640 0.320267 -0.696747 -0.677917

16.5 Grouping by Index Levels


[22]: data=pd.DataFrame(np.random.randn(4,5),
columns=[list("AAABB"),
[1,2,3,1,2]])

72
[23]: data.columns.names=["letter","number"]
data

[23]: letter A B
number 1 2 3 1 2
0 0.741181 -0.399735 0.562333 -1.035530 0.678250
1 0.394531 0.745952 -0.661248 0.811781 -0.804934
2 0.028793 -0.914979 0.857640 -0.780221 1.898880
3 -0.029662 0.092263 1.424289 -0.143006 1.484412

[24]: data.groupby(level="letter",axis=1).sum()

[24]: letter A B
0 0.903779 -0.357280
1 0.479235 0.006847
2 -0.028546 1.118659
3 1.486890 1.341406

16.6 Application with Real Data Set


[25]: game=pd.read_csv("Data/vgsalesGlobale.csv")

[26]: game.head()

[26]: Rank Name Platform Year Genre Publisher \


0 1 Wii Sports Wii 2006.0 Sports Nintendo
1 2 Super Mario Bros. NES 1985.0 Platform Nintendo
2 3 Mario Kart Wii Wii 2008.0 Racing Nintendo
3 4 Wii Sports Resort Wii 2009.0 Sports Nintendo
4 5 Pokemon Red/Pokemon Blue GB 1996.0 Role-Playing Nintendo

NA_Sales EU_Sales JP_Sales Other_Sales Global_Sales


0 41.49 29.02 3.77 8.46 82.74
1 29.08 3.58 6.81 0.77 40.24
2 15.85 12.88 3.79 3.31 35.82
3 15.75 11.01 3.28 2.96 33.00
4 11.27 8.89 10.22 1.00 31.37

[27]: game.dtypes

[27]: Rank int64


Name object
Platform object
Year float64
Genre object
Publisher object
NA_Sales float64

73
EU_Sales float64
JP_Sales float64
Other_Sales float64
Global_Sales float64
dtype: object

[28]: game.dropna().describe()

[28]: Rank Year NA_Sales EU_Sales JP_Sales \


count 16291.000000 16291.000000 16291.000000 16291.000000 16291.000000
mean 8290.190228 2006.405561 0.265647 0.147731 0.078833
std 4792.654450 5.832412 0.822432 0.509303 0.311879
min 1.000000 1980.000000 0.000000 0.000000 0.000000
25% 4132.500000 2003.000000 0.000000 0.000000 0.000000
50% 8292.000000 2007.000000 0.080000 0.020000 0.000000
75% 12439.500000 2010.000000 0.240000 0.110000 0.040000
max 16600.000000 2020.000000 41.490000 29.020000 10.220000

Other_Sales Global_Sales
count 16291.000000 16291.000000
mean 0.048426 0.540910
std 0.190083 1.567345
min 0.000000 0.010000
25% 0.000000 0.060000
50% 0.010000 0.170000
75% 0.040000 0.480000
max 10.570000 82.740000

[29]: game.Global_Sales.mean()

[29]: 0.5374406555006628

[30]: group=game.groupby("Genre")

[31]: group["Global_Sales"].count()

[31]: Genre
Action 3316
Adventure 1286
Fighting 848
Misc 1739
Platform 886
Puzzle 582
Racing 1249
Role-Playing 1488
Shooter 1310
Simulation 867

74
Sports 2346
Strategy 681
Name: Global_Sales, dtype: int64

[32]: group["Global_Sales"].describe()

[32]: count mean std min 25% 50% 75% max


Genre
Action 3316.0 0.528100 1.156427 0.01 0.07 0.190 0.5000 21.40
Adventure 1286.0 0.185879 0.513280 0.01 0.02 0.060 0.1600 11.18
Fighting 848.0 0.529375 0.955965 0.01 0.08 0.210 0.5500 13.04
Misc 1739.0 0.465762 1.314886 0.01 0.06 0.160 0.4100 29.02
Platform 886.0 0.938341 2.585254 0.01 0.09 0.280 0.7900 40.24
Puzzle 582.0 0.420876 1.561716 0.01 0.04 0.110 0.3075 30.26
Racing 1249.0 0.586101 1.662437 0.01 0.07 0.190 0.5300 35.82
Role-Playing 1488.0 0.623233 1.707909 0.01 0.07 0.185 0.5225 31.37
Shooter 1310.0 0.791885 1.817263 0.01 0.08 0.230 0.7275 28.31
Simulation 867.0 0.452364 1.195255 0.01 0.05 0.160 0.4200 24.76
Sports 2346.0 0.567319 2.089716 0.01 0.09 0.220 0.5600 82.74
Strategy 681.0 0.257151 0.520908 0.01 0.04 0.090 0.2700 5.45

[33]: game[game.Genre=="Action"].Global_Sales.mean()

[33]: 0.5281001206272617

[34]: group.mean()

[34]: Rank Year NA_Sales EU_Sales JP_Sales \


Genre
Action 7973.879071 2007.909929 0.264726 0.158323 0.048236
Adventure 11532.787714 2008.130878 0.082271 0.049868 0.040490
Fighting 7646.511792 2004.630383 0.263667 0.119481 0.103007
Misc 8561.847039 2007.258480 0.235906 0.124198 0.061967
Platform 6927.251693 2003.820776 0.504571 0.227573 0.147596
Puzzle 9627.381443 2005.243433 0.212680 0.087251 0.098471
Racing 7961.515612 2004.840131 0.287766 0.190865 0.045388
Role-Playing 8086.174731 2007.055744 0.219946 0.126384 0.236767
Shooter 7369.367939 2005.918877 0.444733 0.239137 0.029221
Simulation 8626.085352 2006.567568 0.211430 0.130773 0.073472
Sports 7425.026428 2005.477865 0.291283 0.160635 0.057702
Strategy 10071.897210 2005.599106 0.100881 0.066579 0.072628

Other_Sales Global_Sales
Genre
Action 0.056508 0.528100
Adventure 0.013072 0.185879
Fighting 0.043255 0.529375

75
Misc 0.043312 0.465762
Platform 0.058228 0.938341
Puzzle 0.021564 0.420876
Racing 0.061865 0.586101
Role-Playing 0.040060 0.623233
Shooter 0.078389 0.791885
Simulation 0.036355 0.452364
Sports 0.057532 0.567319
Strategy 0.016681 0.257151

[35]: %matplotlib inline

[36]: group["Global_Sales"].mean().plot(kind="bar")

[36]: <AxesSubplot:xlabel='Genre'>

[37]: group[["NA_Sales",
"EU_Sales",
"JP_Sales"]].mean().plot(kind="bar")

[37]: <AxesSubplot:xlabel='Genre'>

76
from file: 17-Working with GroupBy

17 Working with GroupBy


[1]: import pandas as pd
import numpy as np

[2]: df=pd.DataFrame({"key":list("ABC")*2,
"data1":range(6),
"data2":np.arange(5,11)})

[3]: df

[3]: key data1 data2


0 A 0 5
1 B 1 6
2 C 2 7
3 A 3 8
4 B 4 9
5 C 5 10

77
[4]: group=df.groupby("key")

[5]: group.aggregate(["min",np.median,"max"])

[5]: data1 data2


min median max min median max
key
A 0 1.5 3 5 6.5 8
B 1 2.5 4 6 7.5 9
C 2 3.5 5 7 8.5 10

[6]: group.agg({"data1":"min","data2":"max"})

[6]: data1 data2


key
A 0 8
B 1 9
C 2 10

[7]: def f(x):


return x.max()-x.min()

[8]: group.agg(f)

[8]: data1 data2


key
A 3 3
B 3 3
C 3 3

17.1 Applying more than one function


[9]: data=pd.DataFrame({"letter":list("ABC")*4,
"num":["one","two"]*6,
"d1":np.random.randn(12),
"d2":np.arange(10,33,2)})

[10]: data

[10]: letter num d1 d2


0 A one 0.988512 10
1 B two 1.318566 12
2 C one 0.606850 14
3 A two -1.029892 16
4 B one 0.232158 18
5 C two 0.380758 20
6 A one -1.202409 22

78
7 B two 0.209867 24
8 C one -1.164496 26
9 A two 0.194654 28
10 B one 0.160770 30
11 C two -0.301062 32

[11]: group=data.groupby(["letter","num"])

[12]: group_d1=group["d1"]

[13]: group_d1.agg("mean")

[13]: letter num


A one -0.106948
two -0.417619
B one 0.196464
two 0.764216
C one -0.278823
two 0.039848
Name: d1, dtype: float64

[14]: group_d1.agg(["mean","std",f])

[14]: mean std f


letter num
A one -0.106948 1.549215 2.190920
two -0.417619 0.865885 1.224546
B one 0.196464 0.050479 0.071388
two 0.764216 0.783969 1.108699
C one -0.278823 1.252531 1.771347
two 0.039848 0.482120 0.681820

[15]: group_d1.agg([("f_mean","mean"),
("f_std",np.std)])

[15]: f_mean f_std


letter num
A one -0.106948 1.549215
two -0.417619 0.865885
B one 0.196464 0.050479
two 0.764216 0.783969
C one -0.278823 1.252531
two 0.039848 0.482120

[16]: group.agg({"d1":["count","max","mean"],
"d2":"sum"})

79
[16]: d1 d2
count max mean sum
letter num
A one 2 0.988512 -0.106948 32
two 2 0.194654 -0.417619 44
B one 2 0.232158 0.196464 48
two 2 1.318566 0.764216 36
C one 2 0.606850 -0.278823 40
two 2 0.380758 0.039848 52

[17]: data.groupby(["letter","num"],
as_index=False).mean()

[17]: letter num d1 d2


0 A one -0.106948 16
1 A two -0.417619 22
2 B one 0.196464 24
3 B two 0.764216 18
4 C one -0.278823 20
5 C two 0.039848 26

17.2 Split-Apply-Combine
[18]: data

[18]: letter num d1 d2


0 A one 0.988512 10
1 B two 1.318566 12
2 C one 0.606850 14
3 A two -1.029892 16
4 B one 0.232158 18
5 C two 0.380758 20
6 A one -1.202409 22
7 B two 0.209867 24
8 C one -1.164496 26
9 A two 0.194654 28
10 B one 0.160770 30
11 C two -0.301062 32

[19]: group=data.groupby("letter")

[20]: group["d2"].apply(lambda x:x.describe())

[20]: letter
A count 4.000000
mean 19.000000
std 7.745967

80
min 10.000000
25% 14.500000
50% 19.000000
75% 23.500000
max 28.000000
B count 4.000000
mean 21.000000
std 7.745967
min 12.000000
25% 16.500000
50% 21.000000
75% 25.500000
max 30.000000
C count 4.000000
mean 23.000000
std 7.745967
min 14.000000
25% 18.500000
50% 23.000000
75% 27.500000
max 32.000000
Name: d2, dtype: float64

[21]: math=pd.DataFrame({"Class":list("AB")*3,
"Stu":["Kim","Sam",
"Tim","Tom","John","Kate"],
"Score":[60,70,np.nan,
55,np.nan,80]})
math

[21]: Class Stu Score


0 A Kim 60.0
1 B Sam 70.0
2 A Tim NaN
3 B Tom 55.0
4 A John NaN
5 B Kate 80.0

[22]: group=math.groupby("Class")

[23]: group.mean()

[23]: Score
Class
A 60.000000
B 68.333333

81
[24]: func=lambda f:f.fillna(f.mean())

[25]: group.apply(func)

[25]: Class Stu Score


Class
A 0 A Kim 60.0
2 A Tim 60.0
4 A John 60.0
B 1 B Sam 70.0
3 B Tom 55.0
5 B Kate 80.0

[26]: value={"A":100,"B":50}

[27]: func1=lambda f:f.fillna(value[f.name])

[28]: group.apply(func1)

[28]: Class Stu Score


0 A Kim 60.0
1 B Sam 70.0
2 A Tim 100.0
3 B Tom 55.0
4 A John 100.0
5 B Kate 80.0

from file: 18-Pivot Tables

18 Pivot Tables
[1]: import pandas as pd
import numpy as np

[2]: df=pd.DataFrame(
{"class":list("ABC")*4,
"lesson":["math","stat"]*6,
"sex":list("MFMM")*3,
"sibling":[1,2,3]*4,
"score":np.arange(40,100,5)})

[3]: df

[3]: class lesson sex sibling score


0 A math M 1 40
1 B stat F 2 45
2 C math M 3 50

82
3 A stat M 1 55
4 B math M 2 60
5 C stat F 3 65
6 A math M 1 70
7 B stat M 2 75
8 C math M 3 80
9 A stat F 1 85
10 B math M 2 90
11 C stat M 3 95

[4]: df.groupby("lesson")["score"].mean()

[4]: lesson
math 65
stat 70
Name: score, dtype: int32

[5]: df.groupby(
["lesson",
"class"])[
"score"].aggregate("mean").unstack()

[5]: class A B C
lesson
math 55 75 65
stat 70 60 80

[6]: df.pivot_table(
"score",
index="lesson",
columns="class")

[6]: class A B C
lesson
math 55 75 65
stat 70 60 80

[7]: df.pivot_table(
["sibling","score"],
index=["class","lesson"],
columns="sex")

[7]: score sibling


sex F M F M
class lesson
A math NaN 55.0 NaN 1.0
stat 85.0 55.0 1.0 1.0

83
B math NaN 75.0 NaN 2.0
stat 45.0 75.0 2.0 2.0
C math NaN 65.0 NaN 3.0
stat 65.0 95.0 3.0 3.0

[8]: df.pivot_table(
["sibling","score"],
index=["class","lesson"],
columns="sex",margins=True)

[8]: score sibling


sex F M All F M All
class lesson
A math NaN 55.000000 55.0 NaN 1.0 1
stat 85.0 55.000000 70.0 1.0 1.0 1
B math NaN 75.000000 75.0 NaN 2.0 2
stat 45.0 75.000000 60.0 2.0 2.0 2
C math NaN 65.000000 65.0 NaN 3.0 3
stat 65.0 95.000000 80.0 3.0 3.0 3
All 65.0 68.333333 67.5 2.0 2.0 2

[9]: df.pivot_table(
["sibling","score"],
index=["class","lesson"],
columns="sex",fill_value=0)

[9]: score sibling


sex F M F M
class lesson
A math 0 55 0 1
stat 85 55 1 1
B math 0 75 0 2
stat 45 75 2 2
C math 0 65 0 3
stat 65 95 3 3

18.1 Multi-level pivot tables


[10]: sib=pd.cut(df["sibling"],[0,2,3])

[11]: df.pivot_table("score",
["lesson",sib],
"class",fill_value=0)

[11]: class A B C
lesson sibling
math (0, 2] 55 75 0

84
(2, 3] 0 0 65
stat (0, 2] 70 60 0
(2, 3] 0 0 80

[12]: df.pivot_table(
"score",
index="lesson",
columns="class")

[12]: class A B C
lesson
math 55 75 65
stat 70 60 80

[13]: df.pivot_table(
index="lesson",
columns="class",
aggfunc="sum")

[13]: score sibling


class A B C A B C
lesson
math 110 150 130 2 4 6
stat 140 120 160 2 4 6

[14]: df.pivot_table(
index="lesson",
columns="class",
aggfunc={"sibling":"max",
"score":"sum"})

[14]: score sibling


class A B C A B C
lesson
math 110 150 130 1 2 3
stat 140 120 160 1 2 3

18.2 Cross-Tabulations: Crosstab


[15]: pd.crosstab(df.sibling,df.lesson)

[15]: lesson math stat


sibling
1 2 2
2 2 2
3 2 2

85
[16]: pd.crosstab([df.sibling, df.lesson], df.sex)

[16]: sex F M
sibling lesson
1 math 0 2
stat 1 1
2 math 0 2
stat 1 1
3 math 0 2
stat 1 1

You can download the data set from https://raw.githubusercontent.com/jakevdp/data-


CDCbirths/master/births.csv

[17]: births=pd.read_csv("Data/births.txt")

[18]: births.head()

[18]: year month day gender births


0 1969 1 1.0 F 4046
1 1969 1 1.0 M 4440
2 1969 1 2.0 F 4454
3 1969 1 2.0 M 4548
4 1969 1 3.0 F 4548

[19]: births["ten_year"]=10*(births["year"]//10)
births.pivot_table("births",
index="ten_year",
columns="gender",
aggfunc="sum")

[19]: gender F M
ten_year
1960 1753634 1846572
1970 16263075 17121550
1980 18310351 19243452
1990 19479454 20420553
2000 18229309 19106428

[20]: %matplotlib inline

[21]: import matplotlib.pyplot as plt


import seaborn as sns

[22]: sns.set() #For style

86
[23]: births.pivot_table("births",
index="year",
columns="gender",
aggfunc="sum").plot()
plt.ylabel("Annual total births")

[23]: Text(0, 0.5, 'Annual total births')

from file: 19-Categorical Data

19 Categorical Data in Pandas


19.1 How is a variable translated into categorical structure?
[1]: import pandas as pd
import numpy as np

[2]: data=pd.Series(["Tim","Tom","Sam","Sam"]*3)
data

[2]: 0 Tim
1 Tom
2 Sam

87
3 Sam
4 Tim
5 Tom
6 Sam
7 Sam
8 Tim
9 Tom
10 Sam
11 Sam
dtype: object

[3]: pd.unique(data)

[3]: array(['Tim', 'Tom', 'Sam'], dtype=object)

[4]: pd.value_counts(data)

[4]: Sam 6
Tom 3
Tim 3
dtype: int64

[5]: values=pd.Series([0,1,0,0]*3)

[6]: names=pd.Series(["Tim","Sam"])
names.take(values)

[6]: 0 Tim
1 Sam
0 Tim
0 Tim
0 Tim
1 Sam
0 Tim
0 Tim
0 Tim
1 Sam
0 Tim
0 Tim
dtype: object

19.2 Categorical Type in Pandas


[7]: data

88
[7]: 0 Tim
1 Tom
2 Sam
3 Sam
4 Tim
5 Tom
6 Sam
7 Sam
8 Tim
9 Tom
10 Sam
11 Sam
dtype: object

[8]: N=len(data)

[9]: df=pd.DataFrame(
{"name":data,
"num":np.arange(N),
"score":np.random.randint(40,100,
size=N),
"weight":np.random.uniform(50,70,
size=N)},
columns=["num","name","score","weight"])

[10]: df

[10]: num name score weight


0 0 Tim 90 58.608318
1 1 Tom 99 67.616725
2 2 Sam 70 58.181046
3 3 Sam 96 56.833079
4 4 Tim 82 55.952711
5 5 Tom 89 52.296487
6 6 Sam 97 53.203579
7 7 Sam 96 63.967189
8 8 Tim 45 57.324508
9 9 Tom 57 58.393265
10 10 Sam 74 67.731591
11 11 Sam 80 55.372677

[11]: df["name"]

[11]: 0 Tim
1 Tom
2 Sam
3 Sam

89
4 Tim
5 Tom
6 Sam
7 Sam
8 Tim
9 Tom
10 Sam
11 Sam
Name: name, dtype: object

[12]: type(df["name"])

[12]: pandas.core.series.Series

[13]: name_cat=df["name"].astype("category")
name_cat

[13]: 0 Tim
1 Tom
2 Sam
3 Sam
4 Tim
5 Tom
6 Sam
7 Sam
8 Tim
9 Tom
10 Sam
11 Sam
Name: name, dtype: category
Categories (3, object): ['Sam', 'Tim', 'Tom']

[14]: x=name_cat.values

[15]: x.categories

[15]: Index(['Sam', 'Tim', 'Tom'], dtype='object')

[16]: x.codes

[16]: array([1, 2, 0, 0, 1, 2, 0, 0, 1, 2, 0, 0], dtype=int8)

[17]: df["name"]=df["name"].astype("category")
df.name

[17]: 0 Tim
1 Tom

90
2 Sam
3 Sam
4 Tim
5 Tom
6 Sam
7 Sam
8 Tim
9 Tom
10 Sam
11 Sam
Name: name, dtype: category
Categories (3, object): ['Sam', 'Tim', 'Tom']

[18]: data_cat=pd.Categorical(list("abcde"))
data_cat

[18]: ['a', 'b', 'c', 'd', 'e']


Categories (5, object): ['a', 'b', 'c', 'd', 'e']

[19]: pd.Categorical(["banana", "apple",


"kiwi", "banana", "apple"])

[19]: ['banana', 'apple', 'kiwi', 'banana', 'apple']


Categories (3, object): ['apple', 'banana', 'kiwi']

[20]: people=["baby", "child", "young", "old"]


codes=[0,1,2,3,1,0,0]
people_cat=pd.Categorical.from_codes(
codes,people)
people_cat

[20]: ['baby', 'child', 'young', 'old', 'child', 'baby', 'baby']


Categories (4, object): ['baby', 'child', 'young', 'old']

[21]: people_cat=pd.Categorical.from_codes(
codes,people,ordered=True)
people_cat

[21]: ['baby', 'child', 'young', 'old', 'child', 'baby', 'baby']


Categories (4, object): ['baby' < 'child' < 'young' < 'old']

[22]: people_cat.as_ordered()

[22]: ['baby', 'child', 'young', 'old', 'child', 'baby', 'baby']


Categories (4, object): ['baby' < 'child' < 'young' < 'old']

91
19.3 Working with Categorical
[23]: data=np.random.randn(1000)

[24]: interval=pd.qcut(data,4)
interval

[24]: [(-2.9739999999999998, -0.668], (0.735, 3.402], (0.735, 3.402], (0.00973,


0.735], (-2.9739999999999998, -0.668], …, (-0.668, 0.00973], (0.735, 3.402],
(-0.668, 0.00973], (0.735, 3.402], (-2.9739999999999998, -0.668]]
Length: 1000
Categories (4, interval[float64]): [(-2.9739999999999998, -0.668] < (-0.668,
0.00973] < (0.00973, 0.735] < (0.735, 3.402]]

[25]: type(interval)

[25]: pandas.core.arrays.categorical.Categorical

[26]: interval=pd.qcut(data,4,labels=["Q1","Q2",
"Q3","Q4"])
interval

[26]: ['Q1', 'Q4', 'Q4', 'Q3', 'Q1', …, 'Q2', 'Q4', 'Q2', 'Q4', 'Q1']
Length: 1000
Categories (4, object): ['Q1' < 'Q2' < 'Q3' < 'Q4']

[27]: interval=pd.Series(interval,name="quarter")

[28]: pd.Series(
data).groupby(
interval).agg(["count",
"min",
"max"]).reset_index()

[28]: quarter count min max


0 Q1 250 -2.973073 -0.669205
1 Q2 250 -0.667556 0.009061
2 Q3 250 0.010393 0.733632
3 Q4 250 0.737847 3.402463

19.4 3- How is the performance of categorical types?


[29]: N=10000000
num=pd.Series(np.random.randn(N))

[30]: label=pd.Series(["a","b","c","d"]*(N//4))

92
[31]: cat=label.astype("category")

[32]: label.memory_usage()

[32]: 80000128

[33]: cat.memory_usage()

[33]: 10000320

19.5 4- What are categorical methods?


[34]: s=pd.Series(["a","b","c","d"]*2)

[35]: s_ct=s.astype("category")
s_ct

[35]: 0 a
1 b
2 c
3 d
4 a
5 b
6 c
7 d
dtype: category
Categories (4, object): ['a', 'b', 'c', 'd']

[36]: s_ct.cat.codes

[36]: 0 0
1 1
2 2
3 3
4 0
5 1
6 2
7 3
dtype: int8

[37]: s_ct.cat.categories

[37]: Index(['a', 'b', 'c', 'd'], dtype='object')

[38]: new_ct=["a","b","c","d","e"]
s_ct.cat.set_categories(new_ct)

93
[38]: 0 a
1 b
2 c
3 d
4 a
5 b
6 c
7 d
dtype: category
Categories (5, object): ['a', 'b', 'c', 'd', 'e']

[39]: s2_ct=s_ct[s_ct.isin(["a","b"])]
s2_ct

[39]: 0 a
1 b
4 a
5 b
dtype: category
Categories (4, object): ['a', 'b', 'c', 'd']

[40]: s2_ct.cat.remove_unused_categories()

[40]: 0 a
1 b
4 a
5 b
dtype: category
Categories (2, object): ['a', 'b']

19.6 5- How to create a dummy variable?


[41]: s_ct

[41]: 0 a
1 b
2 c
3 d
4 a
5 b
6 c
7 d
dtype: category
Categories (4, object): ['a', 'b', 'c', 'd']

[42]: pd.get_dummies(s_ct)

94
[42]: a b c d
0 1 0 0 0
1 0 1 0 0
2 0 0 1 0
3 0 0 0 1
4 1 0 0 0
5 0 1 0 0
6 0 0 1 0
7 0 0 0 1

from file: 20-Working with Text Data

20 Working with Text Data


[1]: "hello".upper()

[1]: 'HELLO'

20.1 Vectorized String Functions


[2]: import pandas as pd
import numpy as np

[3]: data=["tim","Kate","SUSan",np.nan,"aLEX"]

[4]: name=pd.Series(data)

[5]: name.str.capitalize()

[5]: 0 Tim
1 Kate
2 Susan
3 NaN
4 Alex
dtype: object

[6]: name.str.lower()

[6]: 0 tim
1 kate
2 susan
3 NaN
4 alex
dtype: object

[7]: name.str.len()

95
[7]: 0 3.0
1 4.0
2 5.0
3 NaN
4 4.0
dtype: float64

[8]: name.str.startswith("a")

[8]: 0 False
1 False
2 False
3 NaN
4 True
dtype: object

[9]: df=pd.DataFrame(
np.random.randn(3,2),
columns=["Column A","Column B"],
index=range(3))
df

[9]: Column A Column B


0 -0.459978 0.200495
1 0.739367 -2.557691
2 0.371356 0.086189

[10]: df.columns

[10]: Index(['Column A', 'Column B'], dtype='object')

[11]: df.columns.str.lower().str.replace(" ","_")

[11]: Index(['column_a', 'column_b'], dtype='object')

[12]: s=pd.Series(["a_b_c","c_d_e",np.nan,"f_g_h"])
s

[12]: 0 a_b_c
1 c_d_e
2 NaN
3 f_g_h
dtype: object

[13]: s.str.split("_").str[1]

96
[13]: 0 b
1 d
2 NaN
3 g
dtype: object

[14]: s.str.split("_",expand=True,n=1)

[14]: 0 1
0 a b_c
1 c d_e
2 NaN NaN
3 f g_h

[15]: money=pd.Series(["15","-$20","$30000"])
money

[15]: 0 15
1 -$20
2 $30000
dtype: object

[16]: money.str.replace("-\$","")

[16]: 0 15
1 20
2 $30000
dtype: object

[17]: money.str.replace("-\$","-")

[17]: 0 15
1 -20
2 $30000
dtype: object

You can use google or pandas.pydata.org to see the string methods of Pandas documentation.
[18]: film=pd.read_csv("http://bit.ly/imdbratings")

[19]: film.head()

[19]: star_rating title content_rating genre duration \


0 9.3 The Shawshank Redemption R Crime 142
1 9.2 The Godfather R Crime 175
2 9.1 The Godfather: Part II R Crime 200
3 9.0 The Dark Knight PG-13 Action 152

97
4 8.9 Pulp Fiction R Crime 154

actors_list
0 [u'Tim Robbins', u'Morgan Freeman', u'Bob Gunt…
1 [u'Marlon Brando', u'Al Pacino', u'James Caan']
2 [u'Al Pacino', u'Robert De Niro', u'Robert Duv…
3 [u'Christian Bale', u'Heath Ledger', u'Aaron E…
4 [u'John Travolta', u'Uma Thurman', u'Samuel L…

[20]: film.title.str.upper()

[20]: 0 THE SHAWSHANK REDEMPTION


1 THE GODFATHER
2 THE GODFATHER: PART II
3 THE DARK KNIGHT
4 PULP FICTION

974 TOOTSIE
975 BACK TO THE FUTURE PART III
976 MASTER AND COMMANDER: THE FAR SIDE OF THE WORLD
977 POLTERGEIST
978 WALL STREET
Name: title, Length: 979, dtype: object

[21]: film.columns=film.columns.str.capitalize()

[22]: film.head()

[22]: Star_rating Title Content_rating Genre Duration \


0 9.3 The Shawshank Redemption R Crime 142
1 9.2 The Godfather R Crime 175
2 9.1 The Godfather: Part II R Crime 200
3 9.0 The Dark Knight PG-13 Action 152
4 8.9 Pulp Fiction R Crime 154

Actors_list
0 [u'Tim Robbins', u'Morgan Freeman', u'Bob Gunt…
1 [u'Marlon Brando', u'Al Pacino', u'James Caan']
2 [u'Al Pacino', u'Robert De Niro', u'Robert Duv…
3 [u'Christian Bale', u'Heath Ledger', u'Aaron E…
4 [u'John Travolta', u'Uma Thurman', u'Samuel L…

[23]: film[film.Actors_list.str.contains(
"Brad Pitt")]

[23]: Star_rating Title \


9 8.9 Fight Club

98
24 8.7 Se7en
106 8.3 Snatch.
114 8.3 Inglourious Basterds
264 8.1 Twelve Monkeys
508 7.8 The Curious Case of Benjamin Button
577 7.8 Ocean's Eleven
683 7.7 Fury
776 7.6 Moneyball
779 7.6 Interview with the Vampire: The Vampire Chroni…
807 7.6 The Assassination of Jesse James by the Coward…
826 7.5 Sleepers
877 7.5 Legends of the Fall
901 7.5 Babel

Content_rating Genre Duration \


9 R Drama 139
24 R Drama 127
106 R Comedy 102
114 R Adventure 153
264 R Mystery 129
508 PG-13 Drama 166
577 PG-13 Crime 116
683 R Action 134
776 PG-13 Biography 133
779 R Horror 123
807 R Biography 160
826 R Crime 147
877 R Drama 133
901 R Drama 143

Actors_list
9 [u'Brad Pitt', u'Edward Norton', u'Helena Bonh…
24 [u'Morgan Freeman', u'Brad Pitt', u'Kevin Spac…
106 [u'Jason Statham', u'Brad Pitt', u'Benicio Del…
114 [u'Brad Pitt', u'Diane Kruger', u'Eli Roth']
264 [u'Bruce Willis', u'Madeleine Stowe', u'Brad P…
508 [u'Brad Pitt', u'Cate Blanchett', u'Tilda Swin…
577 [u'George Clooney', u'Brad Pitt', u'Julia Robe…
683 [u'Brad Pitt', u'Shia LaBeouf', u'Logan Lerman']
776 [u'Brad Pitt', u'Robin Wright', u'Jonah Hill']
779 [u'Brad Pitt', u'Tom Cruise', u'Antonio Bander…
807 [u'Brad Pitt', u'Casey Affleck', u'Sam Shepard']
826 [u'Robert De Niro', u'Kevin Bacon', u'Brad Pitt']
877 [u'Brad Pitt', u'Anthony Hopkins', u'Aidan Qui…
901 [u'Brad Pitt', u'Cate Blanchett', u'Gael Garc\…

[24]: film.Actors_list.str.replace("[","")

99
[24]: 0 u'Tim Robbins', u'Morgan Freeman', u'Bob Gunton']
1 u'Marlon Brando', u'Al Pacino', u'James Caan']
2 u'Al Pacino', u'Robert De Niro', u'Robert Duva…
3 u'Christian Bale', u'Heath Ledger', u'Aaron Ec…
4 u'John Travolta', u'Uma Thurman', u'Samuel L. …

974 u'Dustin Hoffman', u'Jessica Lange', u'Teri Ga…
975 u'Michael J. Fox', u'Christopher Lloyd', u'Mar…
976 u'Russell Crowe', u'Paul Bettany', u'Billy Boyd']
977 u'JoBeth Williams', u"Heather O'Rourke", u'Cra…
978 u'Charlie Sheen', u'Michael Douglas', u'Tamara…
Name: Actors_list, Length: 979, dtype: object

[25]: film.Actors_list.str.replace(
"[","").str.replace("]","")

[25]: 0 u'Tim Robbins', u'Morgan Freeman', u'Bob Gunton'


1 u'Marlon Brando', u'Al Pacino', u'James Caan'
2 u'Al Pacino', u'Robert De Niro', u'Robert Duvall'
3 u'Christian Bale', u'Heath Ledger', u'Aaron Ec…
4 u'John Travolta', u'Uma Thurman', u'Samuel L. …

974 u'Dustin Hoffman', u'Jessica Lange', u'Teri Garr'
975 u'Michael J. Fox', u'Christopher Lloyd', u'Mar…
976 u'Russell Crowe', u'Paul Bettany', u'Billy Boyd'
977 u'JoBeth Williams', u"Heather O'Rourke", u'Cra…
978 u'Charlie Sheen', u'Michael Douglas', u'Tamara…
Name: Actors_list, Length: 979, dtype: object

from file: 21-Practical Data Analysis with Pandas

21 Practical Data Analysis with Pandas


[1]: import pandas as pd

If you want, you can download the data set from https://openpolicing.stanford.edu/data/. You
can also get information about the data set from this site. Let’s import the data set.
[2]: df=pd.read_csv(
"Data/ca_san_diego_2019_02_25.csv")

[3]: df.head()

[3]: raw_row_number date time service_area subject_age \


0 1 2014-01-01 01:25:00 110 24.0
1 2 2014-01-01 05:47:00 320 42.0
2 3 2014-01-01 07:46:00 320 29.0

100
3 4 2014-01-01 08:10:00 610 23.0
4 5 2014-01-01 08:35:00 930 35.0

subject_race subject_sex type arrest_made citation_issued \


0 white male vehicular False True
1 white male vehicular False False
2 asian/pacific islander male vehicular False False
3 white male vehicular False True
4 hispanic male vehicular False True

warning_issued outcome contraband_found search_conducted search_person \


0 False citation NaN False NaN
1 True warning NaN False NaN
2 True warning NaN False NaN
3 False citation NaN False NaN
4 False citation NaN False NaN

search_vehicle search_basis reason_for_search reason_for_stop


0 NaN NaN NaN Moving Violation
1 NaN NaN NaN Moving Violation
2 NaN NaN NaN Moving Violation
3 NaN NaN NaN Moving Violation
4 NaN NaN NaN Equipment Violation

[4]: df.tail(3)

[4]: raw_row_number date time service_area subject_age \


390996 390997 2017-03-31 23:49:00 620 23.0
390997 390998 2017-03-31 23:55:00 710 NaN
390998 390999 2017-03-31 23:58:00 310 26.0

subject_race subject_sex type arrest_made citation_issued \


390996 hispanic male vehicular False False
390997 hispanic male vehicular NaN True
390998 white female vehicular NaN False

warning_issued outcome contraband_found search_conducted \


390996 False NaN NaN False
390997 False citation NaN NaN
390998 True warning NaN NaN

search_person search_vehicle search_basis reason_for_search \


390996 NaN NaN NaN NaN
390997 NaN NaN NaN NaN
390998 NaN NaN NaN NaN

reason_for_stop

101
390996 Radio Call/Citizen Contact
390997 Moving Violation
390998 Equipment Violation

[5]: df.shape

[5]: (390999, 19)

[6]: df.dtypes

[6]: raw_row_number int64


date object
time object
service_area object
subject_age float64
subject_race object
subject_sex object
type object
arrest_made object
citation_issued object
warning_issued object
outcome object
contraband_found object
search_conducted object
search_person object
search_vehicle object
search_basis object
reason_for_search object
reason_for_stop object
dtype: object

[7]: df.isnull().sum()

[7]: raw_row_number 0
date 132
time 1256
service_area 0
subject_age 12644
subject_race 1398
subject_sex 806
type 0
arrest_made 35022
citation_issued 32712
warning_issued 32712
outcome 40047
contraband_found 379835
search_conducted 37096

102
search_person 376459
search_vehicle 376459
search_basis 374173
reason_for_search 376343
reason_for_stop 266
dtype: int64

[8]: df.date.head()

[8]: 0 2014-01-01
1 2014-01-01
2 2014-01-01
3 2014-01-01
4 2014-01-01
Name: date, dtype: object

[9]: df.columns

[9]: Index(['raw_row_number', 'date', 'time', 'service_area', 'subject_age',


'subject_race', 'subject_sex', 'type', 'arrest_made', 'citation_issued',
'warning_issued', 'outcome', 'contraband_found', 'search_conducted',
'search_person', 'search_vehicle', 'search_basis', 'reason_for_search',
'reason_for_stop'],
dtype='object')

[10]: df["time"].head()

[10]: 0 01:25:00
1 05:47:00
2 07:46:00
3 08:10:00
4 08:35:00
Name: time, dtype: object

[11]: df[["date","time"]].head()

[11]: date time


0 2014-01-01 01:25:00
1 2014-01-01 05:47:00
2 2014-01-01 07:46:00
3 2014-01-01 08:10:00
4 2014-01-01 08:35:00

[12]: df.rename(columns={"date":"DATE",
"time":"TIME"},
inplace=True)
df.head()

103
[12]: raw_row_number DATE TIME service_area subject_age \
0 1 2014-01-01 01:25:00 110 24.0
1 2 2014-01-01 05:47:00 320 42.0
2 3 2014-01-01 07:46:00 320 29.0
3 4 2014-01-01 08:10:00 610 23.0
4 5 2014-01-01 08:35:00 930 35.0

subject_race subject_sex type arrest_made citation_issued \


0 white male vehicular False True
1 white male vehicular False False
2 asian/pacific islander male vehicular False False
3 white male vehicular False True
4 hispanic male vehicular False True

warning_issued outcome contraband_found search_conducted search_person \


0 False citation NaN False NaN
1 True warning NaN False NaN
2 True warning NaN False NaN
3 False citation NaN False NaN
4 False citation NaN False NaN

search_vehicle search_basis reason_for_search reason_for_stop


0 NaN NaN NaN Moving Violation
1 NaN NaN NaN Moving Violation
2 NaN NaN NaN Moving Violation
3 NaN NaN NaN Moving Violation
4 NaN NaN NaN Equipment Violation

[13]: df.iloc[0].head()

[13]: raw_row_number 1
DATE 2014-01-01
TIME 01:25:00
service_area 110
subject_age 24
Name: 0, dtype: object

[14]: df.iloc[0,1]

[14]: '2014-01-01'

[15]: df.iloc[0,[1,3,5]]

[15]: DATE 2014-01-01


service_area 110
subject_race white
Name: 0, dtype: object

104
[16]: df.iloc[0:5,[1,3,5]]

[16]: DATE service_area subject_race


0 2014-01-01 110 white
1 2014-01-01 320 white
2 2014-01-01 320 asian/pacific islander
3 2014-01-01 610 white
4 2014-01-01 930 hispanic

[17]: df.iloc[0:5,0:5]

[17]: raw_row_number DATE TIME service_area subject_age


0 1 2014-01-01 01:25:00 110 24.0
1 2 2014-01-01 05:47:00 320 42.0
2 3 2014-01-01 07:46:00 320 29.0
3 4 2014-01-01 08:10:00 610 23.0
4 5 2014-01-01 08:35:00 930 35.0

[18]: df.loc[1:5,"TIME":"type"]

[18]: TIME service_area subject_age subject_race subject_sex \


1 05:47:00 320 42.0 white male
2 07:46:00 320 29.0 asian/pacific islander male
3 08:10:00 610 23.0 white male
4 08:35:00 930 35.0 hispanic male
5 08:39:00 820 30.0 hispanic male

type
1 vehicular
2 vehicular
3 vehicular
4 vehicular
5 vehicular

[19]: df.loc[0:5,"DATE":"type"].head()

[19]: DATE TIME service_area subject_age subject_race \


0 2014-01-01 01:25:00 110 24.0 white
1 2014-01-01 05:47:00 320 42.0 white
2 2014-01-01 07:46:00 320 29.0 asian/pacific islander
3 2014-01-01 08:10:00 610 23.0 white
4 2014-01-01 08:35:00 930 35.0 hispanic

subject_sex type
0 male vehicular
1 male vehicular
2 male vehicular

105
3 male vehicular
4 male vehicular

[20]: df.shape

[20]: (390999, 19)

[21]: df.dropna(axis="columns",how="all").shape

[21]: (390999, 19)

[22]: df.dropna(axis="columns",how="any").shape

[22]: (390999, 3)

21.1 A Simple Analysis


[23]: df.reason_for_stop.value_counts().head()

[23]: Moving Violation 285562


Equipment Violation 99577
Radio Call/Citizen Contact 1941
Muni, County, H&S Code 1349
Personal Knowledge/Informant 884
Name: reason_for_stop, dtype: int64

[24]: df[df.reason_for_stop==
"Moving Violation"
].subject_sex.value_counts(normalize=True)

[24]: male 0.628904


female 0.371096
Name: subject_sex, dtype: float64

[25]: df[df.reason_for_stop==
"Moving Violation"
].subject_sex.value_counts(
normalize=True)

[25]: male 0.628904


female 0.371096
Name: subject_sex, dtype: float64

[26]: df[df.subject_sex==
"female"
].reason_for_stop.value_counts(
normalize=True).head()

106
[26]: Moving Violation 0.773651
Equipment Violation 0.215736
Radio Call/Citizen Contact 0.003606
Muni, County, H&S Code 0.002516
Personal Knowledge/Informant 0.001609
Name: reason_for_stop, dtype: float64

[27]: df.groupby(
"subject_sex"
).reason_for_stop.value_counts(
normalize=True).head()

[27]: subject_sex reason_for_stop


female Moving Violation 0.773651
Equipment Violation 0.215736
Radio Call/Citizen Contact 0.003606
Muni, County, H&S Code 0.002516
Personal Knowledge/Informant 0.001609
Name: reason_for_stop, dtype: float64

[28]: df.groupby(
"subject_sex"
).reason_for_stop.value_counts(
normalize=True).unstack()

[28]: reason_for_stop &Equipment Violation &Moving Violation \


subject_sex
female 0.000015 0.000044
male 0.000004 0.000059

reason_for_stop &Radio Call/Citizen Contact B & P Equipment Violation \


subject_sex
female NaN 0.000007 0.215736
male 0.000004 NaN 0.276510

reason_for_stop MUNI, County, H&S Code Moving Violation \


subject_sex
female 0.000029 0.773651
male 0.000099 0.707770

reason_for_stop Muni, County, H&S Code NOT CHECKED NOT MARKED … \


subject_sex …
female 0.002516 0.000007 0.000015 …
male 0.003961 NaN NaN …

reason_for_stop Radio Call/Citizen Contact Suspect Info \


subject_sex

107
female 0.003606 0.000044
male 0.005690 0.000043

reason_for_stop Suspect Info (I.S., Bulletin, Log) UNI, &County, H&&S Code \
subject_sex
female 0.000929 0.000110
male 0.001619 0.000229

reason_for_stop none listed not listed not marked not marked not marked \
subject_sex
female 0.000007 NaN 0.000015 0.000007
male 0.000016 0.000004 0.000008 NaN

reason_for_stop not noted not secified


subject_sex
female NaN 0.000007
male 0.000008 NaN

[2 rows x 26 columns]

[29]: df.arrest_made.value_counts()

[29]: False 351060


True 4917
Name: arrest_made, dtype: int64

[30]: df.arrest_made.value_counts(normalize=True)

[30]: False 0.986187


True 0.013813
Name: arrest_made, dtype: float64

[31]: df.groupby(
"subject_sex"
).arrest_made.value_counts(normalize=True)

[31]: subject_sex arrest_made


female False 0.991049
True 0.008951
male False 0.983550
True 0.016450
Name: arrest_made, dtype: float64

[32]: df.groupby(
["subject_race","subject_sex"]
).arrest_made.value_counts(
normalize=True).head()

108
[32]: subject_race subject_sex arrest_made
asian/pacific islander female False 0.993134
True 0.006866
male False 0.987704
True 0.012296
black female False 0.985657
Name: arrest_made, dtype: float64

[33]: df.DATE.str.slice(0,4).value_counts()

[33]: 2014 144164


2015 115422
2016 103051
2017 28230
Name: DATE, dtype: int64

[34]: combined=df.DATE.str.cat(df.TIME, sep=" ")

[35]: df["stop_datetime"]=pd.to_datetime(combined)
df.head()

[35]: raw_row_number DATE TIME service_area subject_age \


0 1 2014-01-01 01:25:00 110 24.0
1 2 2014-01-01 05:47:00 320 42.0
2 3 2014-01-01 07:46:00 320 29.0
3 4 2014-01-01 08:10:00 610 23.0
4 5 2014-01-01 08:35:00 930 35.0

subject_race subject_sex type arrest_made citation_issued \


0 white male vehicular False True
1 white male vehicular False False
2 asian/pacific islander male vehicular False False
3 white male vehicular False True
4 hispanic male vehicular False True

warning_issued outcome contraband_found search_conducted search_person \


0 False citation NaN False NaN
1 True warning NaN False NaN
2 True warning NaN False NaN
3 False citation NaN False NaN
4 False citation NaN False NaN

search_vehicle search_basis reason_for_search reason_for_stop \


0 NaN NaN NaN Moving Violation
1 NaN NaN NaN Moving Violation
2 NaN NaN NaN Moving Violation
3 NaN NaN NaN Moving Violation

109
4 NaN NaN NaN Equipment Violation

stop_datetime
0 2014-01-01 01:25:00
1 2014-01-01 05:47:00
2 2014-01-01 07:46:00
3 2014-01-01 08:10:00
4 2014-01-01 08:35:00

[36]: df.dtypes

[36]: raw_row_number int64


DATE object
TIME object
service_area object
subject_age float64
subject_race object
subject_sex object
type object
arrest_made object
citation_issued object
warning_issued object
outcome object
contraband_found object
search_conducted object
search_person object
search_vehicle object
search_basis object
reason_for_search object
reason_for_stop object
stop_datetime datetime64[ns]
dtype: object

[37]: df.stop_datetime.dt.month.head()

[37]: 0 1.0
1 1.0
2 1.0
3 1.0
4 1.0
Name: stop_datetime, dtype: float64

[38]: df.arrest_made.head()

[38]: 0 False
1 False
2 False

110
3 False
4 False
Name: arrest_made, dtype: object

[39]: df["arrest_made"]=df.arrest_made.astype(bool)

[40]: df.arrest_made.value_counts()

[40]: False 351060


True 39939
Name: arrest_made, dtype: int64

[41]: df.arrest_made.mean()

[41]: 0.10214604129422326

[42]: df.groupby(
df.stop_datetime.dt.hour
).arrest_made.mean().head()

[42]: stop_datetime
0.0 0.113073
1.0 0.140684
2.0 0.141698
3.0 0.127541
4.0 0.105165
Name: arrest_made, dtype: float64

[43]: %matplotlib inline

[44]: df.groupby(
df.stop_datetime.dt.hour
).arrest_made.mean().plot()

[44]: <AxesSubplot:xlabel='stop_datetime'>

111
[45]: df.stop_datetime.dt.hour.value_counts().head()

[45]: 10.0 30590


9.0 29751
8.0 27865
15.0 24598
0.0 22481
Name: stop_datetime, dtype: int64

[46]: df.stop_datetime.dt.hour.value_counts().sort_index().head()

[46]: 0.0 22481


1.0 8544
2.0 5914
3.0 3591
4.0 3214
Name: stop_datetime, dtype: int64

[47]: df.stop_datetime.dt.hour.value_counts().sort_index().plot()

[47]: <AxesSubplot:>

112
from file: 22-Multiple Selecting-Filtering

22 Multiple Selecting & Filtering in Pandas


[1]: import pandas as pd

[2]: film=pd.read_csv("http://bit.ly/imdbratings")

[3]: film.head(3)

[3]: star_rating title content_rating genre duration \


0 9.3 The Shawshank Redemption R Crime 142
1 9.2 The Godfather R Crime 175
2 9.1 The Godfather: Part II R Crime 200

actors_list
0 [u'Tim Robbins', u'Morgan Freeman', u'Bob Gunt…
1 [u'Marlon Brando', u'Al Pacino', u'James Caan']
2 [u'Al Pacino', u'Robert De Niro', u'Robert Duv…

22.1 Practical column selection


[4]: film["title"].head()

113
[4]: 0 The Shawshank Redemption
1 The Godfather
2 The Godfather: Part II
3 The Dark Knight
4 Pulp Fiction
Name: title, dtype: object

[5]: film[["title","genre"]].head()

[5]: title genre


0 The Shawshank Redemption Crime
1 The Godfather Crime
2 The Godfather: Part II Crime
3 The Dark Knight Action
4 Pulp Fiction Crime

22.2 loc method


[6]: film.loc[0,]

[6]: star_rating 9.3


title The Shawshank Redemption
content_rating R
genre Crime
duration 142
actors_list [u'Tim Robbins', u'Morgan Freeman', u'Bob Gunt…
Name: 0, dtype: object

[7]: film.loc[[0,2,4],]

[7]: star_rating title content_rating genre duration \


0 9.3 The Shawshank Redemption R Crime 142
2 9.1 The Godfather: Part II R Crime 200
4 8.9 Pulp Fiction R Crime 154

actors_list
0 [u'Tim Robbins', u'Morgan Freeman', u'Bob Gunt…
2 [u'Al Pacino', u'Robert De Niro', u'Robert Duv…
4 [u'John Travolta', u'Uma Thurman', u'Samuel L…

[8]: film.loc[0:2,]

[8]: star_rating title content_rating genre duration \


0 9.3 The Shawshank Redemption R Crime 142
1 9.2 The Godfather R Crime 175
2 9.1 The Godfather: Part II R Crime 200

114
actors_list
0 [u'Tim Robbins', u'Morgan Freeman', u'Bob Gunt…
1 [u'Marlon Brando', u'Al Pacino', u'James Caan']
2 [u'Al Pacino', u'Robert De Niro', u'Robert Duv…

[9]: film.loc[0:5,"title"]

[9]: 0 The Shawshank Redemption


1 The Godfather
2 The Godfather: Part II
3 The Dark Knight
4 Pulp Fiction
5 12 Angry Men
Name: title, dtype: object

[10]: film.loc[0:5,"title":"genre"]

[10]: title content_rating genre


0 The Shawshank Redemption R Crime
1 The Godfather R Crime
2 The Godfather: Part II R Crime
3 The Dark Knight PG-13 Action
4 Pulp Fiction R Crime
5 12 Angry Men NOT RATED Drama

[11]: film.loc[0:5,"title":"duration"]

[11]: title content_rating genre duration


0 The Shawshank Redemption R Crime 142
1 The Godfather R Crime 175
2 The Godfather: Part II R Crime 200
3 The Dark Knight PG-13 Action 152
4 Pulp Fiction R Crime 154
5 12 Angry Men NOT RATED Drama 96

[12]: film.loc[:,"title":"genre"].head()

[12]: title content_rating genre


0 The Shawshank Redemption R Crime
1 The Godfather R Crime
2 The Godfather: Part II R Crime
3 The Dark Knight PG-13 Action
4 Pulp Fiction R Crime

[13]: film.loc[film.genre=="Crime",].head(3)

115
[13]: star_rating title content_rating genre duration \
0 9.3 The Shawshank Redemption R Crime 142
1 9.2 The Godfather R Crime 175
2 9.1 The Godfather: Part II R Crime 200

actors_list
0 [u'Tim Robbins', u'Morgan Freeman', u'Bob Gunt…
1 [u'Marlon Brando', u'Al Pacino', u'James Caan']
2 [u'Al Pacino', u'Robert De Niro', u'Robert Duv…

[14]: film.loc[film.genre=="Crime",[
"title","duration"]]

[14]: title duration


0 The Shawshank Redemption 142
1 The Godfather 175
2 The Godfather: Part II 200
4 Pulp Fiction 154
21 City of God 130
.. … …
927 Brick 110
931 Mean Streets 112
950 Bound 108
969 Law Abiding Citizen 109
978 Wall Street 126

[124 rows x 2 columns]

[15]: film.loc[
film.genre=="Crime","title":"duration"]

[15]: title content_rating genre duration


0 The Shawshank Redemption R Crime 142
1 The Godfather R Crime 175
2 The Godfather: Part II R Crime 200
4 Pulp Fiction R Crime 154
21 City of God R Crime 130
.. … … … …
927 Brick R Crime 110
931 Mean Streets R Crime 112
950 Bound R Crime 108
969 Law Abiding Citizen R Crime 109
978 Wall Street R Crime 126

[124 rows x 4 columns]

116
22.3 iloc method
[16]: film.iloc[:,0].head()

[16]: 0 9.3
1 9.2
2 9.1
3 9.0
4 8.9
Name: star_rating, dtype: float64

[17]: film.columns

[17]: Index(['star_rating', 'title', 'content_rating', 'genre', 'duration',


'actors_list'],
dtype='object')

[18]: film.iloc[:,[0,3]].head()

[18]: star_rating genre


0 9.3 Crime
1 9.2 Crime
2 9.1 Crime
3 9.0 Action
4 8.9 Crime

[19]: film.iloc[:,0:3].head()

[19]: star_rating title content_rating


0 9.3 The Shawshank Redemption R
1 9.2 The Godfather R
2 9.1 The Godfather: Part II R
3 9.0 The Dark Knight PG-13
4 8.9 Pulp Fiction R

[20]: film.iloc[0,0:3]

[20]: star_rating 9.3


title The Shawshank Redemption
content_rating R
Name: 0, dtype: object

[21]: film.iloc[0:5,0:3]

[21]: star_rating title content_rating


0 9.3 The Shawshank Redemption R
1 9.2 The Godfather R
2 9.1 The Godfather: Part II R

117
3 9.0 The Dark Knight PG-13
4 8.9 Pulp Fiction R

[22]: film.iloc[0:5,].head(3)

[22]: star_rating title content_rating genre duration \


0 9.3 The Shawshank Redemption R Crime 142
1 9.2 The Godfather R Crime 175
2 9.1 The Godfather: Part II R Crime 200

actors_list
0 [u'Tim Robbins', u'Morgan Freeman', u'Bob Gunt…
1 [u'Marlon Brando', u'Al Pacino', u'James Caan']
2 [u'Al Pacino', u'Robert De Niro', u'Robert Duv…

[23]: film.iloc[0:5,:].head(3)

[23]: star_rating title content_rating genre duration \


0 9.3 The Shawshank Redemption R Crime 142
1 9.2 The Godfather R Crime 175
2 9.1 The Godfather: Part II R Crime 200

actors_list
0 [u'Tim Robbins', u'Morgan Freeman', u'Bob Gunt…
1 [u'Marlon Brando', u'Al Pacino', u'James Caan']
2 [u'Al Pacino', u'Robert De Niro', u'Robert Duv…

## Multiple filtering
[24]: film.loc[film.duration>=200,].head(3)

[24]: star_rating title content_rating \


2 9.1 The Godfather: Part II R
7 8.9 The Lord of the Rings: The Return of the King PG-13
17 8.7 Seven Samurai UNRATED

genre duration actors_list


2 Crime 200 [u'Al Pacino', u'Robert De Niro', u'Robert Duv…
7 Adventure 201 [u'Elijah Wood', u'Viggo Mortensen', u'Ian McK…
17 Drama 207 [u'Toshir\xf4 Mifune', u'Takashi Shimura', u'K…

[25]: film.loc[film.duration>=200,"title"].head()

[25]: 2 The Godfather: Part II


7 The Lord of the Rings: The Return of the King
17 Seven Samurai
78 Once Upon a Time in America

118
85 Lawrence of Arabia
Name: title, dtype: object

[26]: film[(
film.duration>=200)|(
film.genre=="Crime")|(
film.genre=="Action")].head(3)

[26]: star_rating title content_rating genre duration \


0 9.3 The Shawshank Redemption R Crime 142
1 9.2 The Godfather R Crime 175
2 9.1 The Godfather: Part II R Crime 200

actors_list
0 [u'Tim Robbins', u'Morgan Freeman', u'Bob Gunt…
1 [u'Marlon Brando', u'Al Pacino', u'James Caan']
2 [u'Al Pacino', u'Robert De Niro', u'Robert Duv…

[27]: film[film.genre.isin([
"Crime","Drama","Action"])]

[27]: star_rating title \


0 9.3 The Shawshank Redemption
1 9.2 The Godfather
2 9.1 The Godfather: Part II
3 9.0 The Dark Knight
4 8.9 Pulp Fiction
.. … …
970 7.4 Wonder Boys
972 7.4 Blue Valentine
973 7.4 The Cider House Rules
976 7.4 Master and Commander: The Far Side of the World
978 7.4 Wall Street

content_rating genre duration \


0 R Crime 142
1 R Crime 175
2 R Crime 200
3 PG-13 Action 152
4 R Crime 154
.. … … …
970 R Drama 107
972 NC-17 Drama 112
973 PG-13 Drama 126
976 PG-13 Action 138
978 R Crime 126

119
actors_list
0 [u'Tim Robbins', u'Morgan Freeman', u'Bob Gunt…
1 [u'Marlon Brando', u'Al Pacino', u'James Caan']
2 [u'Al Pacino', u'Robert De Niro', u'Robert Duv…
3 [u'Christian Bale', u'Heath Ledger', u'Aaron E…
4 [u'John Travolta', u'Uma Thurman', u'Samuel L…
.. …
970 [u'Michael Douglas', u'Tobey Maguire', u'Franc…
972 [u'Ryan Gosling', u'Michelle Williams', u'John…
973 [u'Tobey Maguire', u'Charlize Theron', u'Micha…
976 [u'Russell Crowe', u'Paul Bettany', u'Billy Bo…
978 [u'Charlie Sheen', u'Michael Douglas', u'Tamar…

[538 rows x 6 columns]

from file: 23-Time Series Basics with Pandas

23 Time Series Basics with Pandas


23.1 What is the time series?
[1]: import pandas as pd
import numpy as np
from datetime import datetime

[2]: date=[datetime(2020,1,5),
datetime(2020,1,10),
datetime(2020,1,15),
datetime(2020,1,20),
datetime(2020,1,25)]

[3]: ts=pd.Series(np.random.randn(5),index=date)
ts

[3]: 2020-01-05 1.021772


2020-01-10 0.132016
2020-01-15 -0.838591
2020-01-20 -0.749657
2020-01-25 1.043829
dtype: float64

[4]: ts.index

[4]: DatetimeIndex(['2020-01-05', '2020-01-10', '2020-01-15', '2020-01-20',


'2020-01-25'],
dtype='datetime64[ns]', freq=None)

120
23.2 Time Series Data Structures
[5]: pd.to_datetime("01/01/2020")

[5]: Timestamp('2020-01-01 00:00:00')

[6]: dates=pd.to_datetime(
[datetime(2020,7,5),
"6th of July, 2020",
"2020-Jul-7",
"20200708"])
dates

[6]: DatetimeIndex(['2020-07-05', '2020-07-06', '2020-07-07', '2020-07-08'],


dtype='datetime64[ns]', freq=None)

[7]: dates.to_period("D")

[7]: PeriodIndex(['2020-07-05', '2020-07-06', '2020-07-07', '2020-07-08'],


dtype='period[D]', freq='D')

[8]: dates-dates[0]

[8]: TimedeltaIndex(['0 days', '1 days', '2 days', '3 days'],


dtype='timedelta64[ns]', freq=None)

23.3 Creating a Time Series


[9]: pd.date_range("2020-08-15","2020-09-01")

[9]: DatetimeIndex(['2020-08-15', '2020-08-16', '2020-08-17', '2020-08-18',


'2020-08-19', '2020-08-20', '2020-08-21', '2020-08-22',
'2020-08-23', '2020-08-24', '2020-08-25', '2020-08-26',
'2020-08-27', '2020-08-28', '2020-08-29', '2020-08-30',
'2020-08-31', '2020-09-01'],
dtype='datetime64[ns]', freq='D')

[10]: pd.date_range('2020-07-15', periods=10)

[10]: DatetimeIndex(['2020-07-15', '2020-07-16', '2020-07-17', '2020-07-18',


'2020-07-19', '2020-07-20', '2020-07-21', '2020-07-22',
'2020-07-23', '2020-07-24'],
dtype='datetime64[ns]', freq='D')

[11]: pd.date_range("2020-07-15",
periods=10,
freq="H")

121
[11]: DatetimeIndex(['2020-07-15 00:00:00', '2020-07-15 01:00:00',
'2020-07-15 02:00:00', '2020-07-15 03:00:00',
'2020-07-15 04:00:00', '2020-07-15 05:00:00',
'2020-07-15 06:00:00', '2020-07-15 07:00:00',
'2020-07-15 08:00:00', '2020-07-15 09:00:00'],
dtype='datetime64[ns]', freq='H')

[12]: pd.period_range("2020-10",
periods=10,
freq="M")

[12]: PeriodIndex(['2020-10', '2020-11', '2020-12', '2021-01', '2021-02', '2021-03',


'2021-04', '2021-05', '2021-06', '2021-07'],
dtype='period[M]', freq='M')

[13]: pd.timedelta_range(0,periods=8,freq="H")

[13]: TimedeltaIndex(['0 days 00:00:00', '0 days 01:00:00', '0 days 02:00:00',
'0 days 03:00:00', '0 days 04:00:00', '0 days 05:00:00',
'0 days 06:00:00', '0 days 07:00:00'],
dtype='timedelta64[ns]', freq='H')

[14]: stamp=ts.index[1]
stamp

[14]: Timestamp('2020-01-10 00:00:00')

[15]: ts[stamp]

[15]: 0.1320158576081964

[16]: ts["25.1.2020"]

[16]: 1.0438287426871447

[17]: ts["20200125"]

[17]: 1.0438287426871447

[18]: long_ts=pd.Series(
np.random.randn(1000),
index=pd.date_range("1/1/2020",
periods=1000))
long_ts.head()

[18]: 2020-01-01 -0.563091


2020-01-02 -0.696792

122
2020-01-03 -1.171112
2020-01-04 0.140425
2020-01-05 1.861661
Freq: D, dtype: float64

[19]: long_ts["2020"].head()

[19]: 2020-01-01 -0.563091


2020-01-02 -0.696792
2020-01-03 -1.171112
2020-01-04 0.140425
2020-01-05 1.861661
Freq: D, dtype: float64

[20]: long_ts["2020-10"].head(15)

[20]: 2020-10-01 -0.074736


2020-10-02 -0.376618
2020-10-03 0.846641
2020-10-04 0.625867
2020-10-05 -0.723827
2020-10-06 0.588691
2020-10-07 -0.105803
2020-10-08 -2.083312
2020-10-09 1.587779
2020-10-10 -1.055614
2020-10-11 -1.249746
2020-10-12 -0.501513
2020-10-13 -1.527727
2020-10-14 0.757313
2020-10-15 1.649080
Freq: D, dtype: float64

[21]: long_ts[datetime(2022,9,20):]

[21]: 2022-09-20 -0.544946


2022-09-21 0.724151
2022-09-22 -0.733824
2022-09-23 0.809910
2022-09-24 -1.342181
2022-09-25 0.311458
2022-09-26 -1.724693
Freq: D, dtype: float64

123
23.4 The Important Methods Used in Time Series
[22]: ts

[22]: 2020-01-05 1.021772


2020-01-10 0.132016
2020-01-15 -0.838591
2020-01-20 -0.749657
2020-01-25 1.043829
dtype: float64

[23]: ts.truncate(after="1/15/2020")

[23]: 2020-01-05 1.021772


2020-01-10 0.132016
2020-01-15 -0.838591
dtype: float64

[24]: date=pd.date_range("1/1/2020",
periods=100,
freq="W-SUN")

[25]: long_df=pd.DataFrame(np.random.randn(100,4),
index=date,
columns=list("ABCD"))
long_df.head()

[25]: A B C D
2020-01-05 0.298570 0.541989 0.270855 0.812892
2020-01-12 0.559454 0.052274 -0.129981 -1.355922
2020-01-19 -0.936940 -1.359541 0.060583 0.055678
2020-01-26 1.413215 -1.267353 0.260763 1.158904
2020-02-02 -0.480138 1.104503 0.925559 -0.183879

[26]: long_df["2020-10"]

[26]: A B C D
2020-10-04 0.273074 0.634981 -0.887838 0.148878
2020-10-11 0.108231 -0.053346 -0.520075 -1.125796
2020-10-18 -0.757139 -0.287251 -1.123849 1.163562
2020-10-25 0.550777 0.295018 0.622508 -0.551081

[27]: date=pd.DatetimeIndex(
["1/1/2020","1/2/2020","1/2/2020",
"1/2/2020","1/3/2020"])
ts1=pd.Series(np.arange(5),index=date)
ts1

124
[27]: 2020-01-01 0
2020-01-02 1
2020-01-02 2
2020-01-02 3
2020-01-03 4
dtype: int32

[28]: ts1.index.is_unique

[28]: False

[29]: group=ts1.groupby(level=0)

[30]: group.count()

[30]: 2020-01-01 1
2020-01-02 3
2020-01-03 1
dtype: int64

[31]: group.mean()

[31]: 2020-01-01 0
2020-01-02 2
2020-01-03 4
dtype: int32

from file: 24-Working with Methods in Pandas - Part 1

24 Working with Methods in Pandas - Part 1


[1]: import pandas as pd ; import numpy as np

[2]: date=pd.date_range(
start="2018",end="2019", freq="BM")

[3]: ts=pd.Series(
np.random.randn(len(date)),index=date)
ts

[3]: 2018-01-31 -1.977003


2018-02-28 -0.339459
2018-03-30 -0.587687
2018-04-30 1.141997
2018-05-31 -0.125199
2018-06-29 -1.090406
2018-07-31 -0.435640

125
2018-08-31 0.181651
2018-09-28 -2.518869
2018-10-31 1.428868
2018-11-30 -0.357551
2018-12-31 0.612771
Freq: BM, dtype: float64

[4]: ts.index

[4]: DatetimeIndex(['2018-01-31', '2018-02-28', '2018-03-30', '2018-04-30',


'2018-05-31', '2018-06-29', '2018-07-31', '2018-08-31',
'2018-09-28', '2018-10-31', '2018-11-30', '2018-12-31'],
dtype='datetime64[ns]', freq='BM')

[5]: ts[:5].index

[5]: DatetimeIndex(['2018-01-31', '2018-02-28', '2018-03-30', '2018-04-30',


'2018-05-31'],
dtype='datetime64[ns]', freq='BM')

24.1 Reading a Time Series Data Set


[6]: fb=pd.read_csv("FB.csv")

[7]: fb.head()

[7]: Date Open High Low Close Adj Close \


0 2018-07-30 175.300003 175.300003 166.559998 171.059998 171.059998
1 2018-07-31 170.669998 174.240005 170.000000 172.580002 172.580002
2 2018-08-01 173.929993 175.080002 170.899994 171.649994 171.649994
3 2018-08-02 170.679993 176.789993 170.270004 176.369995 176.369995
4 2018-08-03 177.690002 178.850006 176.149994 177.779999 177.779999

Volume
0 65280800
1 40356500
2 34042100
3 32400000
4 24763400

24.2 Converting date to index


[8]: fb.dtypes

[8]: Date object


Open float64

126
High float64
Low float64
Close float64
Adj Close float64
Volume int64
dtype: object

[9]: fb=pd.read_csv(
"FB.csv", parse_dates=["Date"])

[10]: fb=pd.read_csv(
"FB.csv",
parse_dates=["Date"],
index_col="Date")

[11]: fb.index

[11]: DatetimeIndex(['2018-07-30', '2018-07-31', '2018-08-01', '2018-08-02',


'2018-08-03', '2018-08-06', '2018-08-07', '2018-08-08',
'2018-08-09', '2018-08-10',

'2019-07-16', '2019-07-17', '2019-07-18', '2019-07-19',
'2019-07-22', '2019-07-23', '2019-07-24', '2019-07-25',
'2019-07-26', '2019-07-29'],
dtype='datetime64[ns]', name='Date', length=251, freq=None)

[12]: fb.head()

[12]: Open High Low Close Adj Close \


Date
2018-07-30 175.300003 175.300003 166.559998 171.059998 171.059998
2018-07-31 170.669998 174.240005 170.000000 172.580002 172.580002
2018-08-01 173.929993 175.080002 170.899994 171.649994 171.649994
2018-08-02 170.679993 176.789993 170.270004 176.369995 176.369995
2018-08-03 177.690002 178.850006 176.149994 177.779999 177.779999

Volume
Date
2018-07-30 65280800
2018-07-31 40356500
2018-08-01 34042100
2018-08-02 32400000
2018-08-03 24763400

[13]: fb["2019-06"]

127
[13]: Open High Low Close Adj Close \
Date
2019-06-03 175.000000 175.050003 161.009995 164.149994 164.149994
2019-06-04 163.710007 168.279999 160.839996 167.500000 167.500000
2019-06-05 167.479996 168.720001 164.630005 168.169998 168.169998
2019-06-06 168.300003 169.699997 167.229996 168.330002 168.330002
2019-06-07 170.169998 173.869995 168.839996 173.350006 173.350006
2019-06-10 174.750000 177.860001 173.800003 174.820007 174.820007
2019-06-11 178.479996 179.979996 176.789993 178.100006 178.100006
2019-06-12 178.380005 179.270004 172.880005 175.039993 175.039993
2019-06-13 175.529999 178.029999 174.610001 177.470001 177.470001
2019-06-14 180.509995 181.839996 180.000000 181.330002 181.330002

Volume
Date
2019-06-03 56059600
2019-06-04 46044300
2019-06-05 19758300
2019-06-06 12446400
2019-06-07 16917300
2019-06-10 14767900
2019-06-11 15266600
2019-06-12 17699800
2019-06-13 12253600
2019-06-14 16773700

24.3 Working with Indexes


[14]: fb["2019-06"].Close.mean()

[14]: 181.27450025000002

[15]: fb["2019-07-05":"2019-07-10"]

[15]: Open High Low Close Adj Close \


Date
2019-07-05 196.179993 197.070007 194.169998 196.399994 196.399994
2019-07-08 195.190002 196.679993 193.639999 195.759995 195.759995
2019-07-09 194.970001 199.460007 194.889999 199.210007 199.210007
2019-07-10 200.000000 202.960007 199.669998 202.729996 202.729996

Volume
Date
2019-07-05 11164100
2019-07-08 9723900
2019-07-09 14698600
2019-07-10 20571700

128
[16]: t=pd.to_datetime("7/22/2019")
t

[16]: Timestamp('2019-07-22 00:00:00')

[17]: fb.loc[fb.index>=t,:]

[17]: Open High Low Close Adj Close \


Date
2019-07-22 199.910004 202.570007 198.809998 202.320007 202.320007
2019-07-23 202.839996 204.240005 200.960007 202.360001 202.360001
2019-07-24 197.630005 204.809998 197.220001 204.660004 204.660004
2019-07-25 206.699997 208.660004 198.259995 200.710007 200.710007
2019-07-26 200.190002 202.880005 196.250000 199.750000 199.750000
2019-07-29 199.000000 199.590302 197.880005 198.059998 198.059998

Volume
Date
2019-07-22 13589000
2019-07-23 14583700
2019-07-24 32532500
2019-07-25 39889900
2019-07-26 24426700
2019-07-29 754198

24.4 Dating the Data Set


[18]: fb1=pd.read_csv("FB-no-date.csv",sep=";")

[19]: fb1.head()

[19]: Open High Low Close Adj Close Volume


0 162600006 163130005 161690002 162279999 162279999 11097800
1 163899994 167500000 163830002 167369995 167369995 18894700
2 167369995 171880005 166550003 171259995 171259995 28187900
3 172899994 173570007 171270004 172509995 172509995 21531700
4 171500000 171740005 167610001 169130005 169130005 18306500

[20]: dates=pd.date_range(start="03/01/2019",
end="03/29/2019",
freq="B")
dates

[20]: DatetimeIndex(['2019-03-01', '2019-03-04', '2019-03-05', '2019-03-06',


'2019-03-07', '2019-03-08', '2019-03-11', '2019-03-12',
'2019-03-13', '2019-03-14', '2019-03-15', '2019-03-18',
'2019-03-19', '2019-03-20', '2019-03-21', '2019-03-22',

129
'2019-03-25', '2019-03-26', '2019-03-27', '2019-03-28',
'2019-03-29'],
dtype='datetime64[ns]', freq='B')

[21]: fb1.set_index(dates,inplace=True)

[22]: fb1.head()

[22]: Open High Low Close Adj Close Volume


2019-03-01 162600006 163130005 161690002 162279999 162279999 11097800
2019-03-04 163899994 167500000 163830002 167369995 167369995 18894700
2019-03-05 167369995 171880005 166550003 171259995 171259995 28187900
2019-03-06 172899994 173570007 171270004 172509995 172509995 21531700
2019-03-07 171500000 171740005 167610001 169130005 169130005 18306500

[23]: fb1.index

[23]: DatetimeIndex(['2019-03-01', '2019-03-04', '2019-03-05', '2019-03-06',


'2019-03-07', '2019-03-08', '2019-03-11', '2019-03-12',
'2019-03-13', '2019-03-14', '2019-03-15', '2019-03-18',
'2019-03-19', '2019-03-20', '2019-03-21', '2019-03-22',
'2019-03-25', '2019-03-26', '2019-03-27', '2019-03-28',
'2019-03-29'],
dtype='datetime64[ns]', freq='B')

24.5 Visualizing Time Series Data


[24]: %matplotlib inline

[25]: fb1.Close.plot()

[25]: <AxesSubplot:>

130
[26]: fb1.asfreq("H",method="pad").head()

[26]: Open High Low Close Adj Close \


2019-03-01 00:00:00 162600006 163130005 161690002 162279999 162279999
2019-03-01 01:00:00 162600006 163130005 161690002 162279999 162279999
2019-03-01 02:00:00 162600006 163130005 161690002 162279999 162279999
2019-03-01 03:00:00 162600006 163130005 161690002 162279999 162279999
2019-03-01 04:00:00 162600006 163130005 161690002 162279999 162279999

Volume
2019-03-01 00:00:00 11097800
2019-03-01 01:00:00 11097800
2019-03-01 02:00:00 11097800
2019-03-01 03:00:00 11097800
2019-03-01 04:00:00 11097800

[27]: fb1.asfreq("W", method="pad")

[27]: Open High Low Close Adj Close Volume


2019-03-03 162600006 163130005 161690002 162279999 162279999 11097800
2019-03-10 166199997 169619995 165970001 169600006 169600006 13184800
2019-03-17 167160004 167580002 162509995 165979996 165979996 37135400

131
2019-03-24 165649994 167419998 164089996 164339996 164339996 16389200

[28]: fb1.asfreq("H", method="pad")

[28]: Open High Low Close Adj Close \


2019-03-01 00:00:00 162600006 163130005 161690002 162279999 162279999
2019-03-01 01:00:00 162600006 163130005 161690002 162279999 162279999
2019-03-01 02:00:00 162600006 163130005 161690002 162279999 162279999
2019-03-01 03:00:00 162600006 163130005 161690002 162279999 162279999
2019-03-01 04:00:00 162600006 163130005 161690002 162279999 162279999
… … … … … …
2019-03-28 20:00:00 164570007 166720001 163330002 165550003 165550003
2019-03-28 21:00:00 164570007 166720001 163330002 165550003 165550003
2019-03-28 22:00:00 164570007 166720001 163330002 165550003 165550003
2019-03-28 23:00:00 164570007 166720001 163330002 165550003 165550003
2019-03-29 00:00:00 166389999 167190002 164809998 166690002 166690002

Volume
2019-03-01 00:00:00 11097800
2019-03-01 01:00:00 11097800
2019-03-01 02:00:00 11097800
2019-03-01 03:00:00 11097800
2019-03-01 04:00:00 11097800
… …
2019-03-28 20:00:00 10443000
2019-03-28 21:00:00 10443000
2019-03-28 22:00:00 10443000
2019-03-28 23:00:00 10443000
2019-03-29 00:00:00 13455500

[673 rows x 6 columns]

[29]: z=pd.date_range(start="3/1/2019",
periods=60 , freq="B")
z

[29]: DatetimeIndex(['2019-03-01', '2019-03-04', '2019-03-05', '2019-03-06',


'2019-03-07', '2019-03-08', '2019-03-11', '2019-03-12',
'2019-03-13', '2019-03-14', '2019-03-15', '2019-03-18',
'2019-03-19', '2019-03-20', '2019-03-21', '2019-03-22',
'2019-03-25', '2019-03-26', '2019-03-27', '2019-03-28',
'2019-03-29', '2019-04-01', '2019-04-02', '2019-04-03',
'2019-04-04', '2019-04-05', '2019-04-08', '2019-04-09',
'2019-04-10', '2019-04-11', '2019-04-12', '2019-04-15',
'2019-04-16', '2019-04-17', '2019-04-18', '2019-04-19',
'2019-04-22', '2019-04-23', '2019-04-24', '2019-04-25',
'2019-04-26', '2019-04-29', '2019-04-30', '2019-05-01',

132
'2019-05-02', '2019-05-03', '2019-05-06', '2019-05-07',
'2019-05-08', '2019-05-09', '2019-05-10', '2019-05-13',
'2019-05-14', '2019-05-15', '2019-05-16', '2019-05-17',
'2019-05-20', '2019-05-21', '2019-05-22', '2019-05-23'],
dtype='datetime64[ns]', freq='B')

[30]: z=pd.date_range(
start="3/1/2019", periods=30, freq="H")
z

[30]: DatetimeIndex(['2019-03-01 00:00:00', '2019-03-01 01:00:00',


'2019-03-01 02:00:00', '2019-03-01 03:00:00',
'2019-03-01 04:00:00', '2019-03-01 05:00:00',
'2019-03-01 06:00:00', '2019-03-01 07:00:00',
'2019-03-01 08:00:00', '2019-03-01 09:00:00',
'2019-03-01 10:00:00', '2019-03-01 11:00:00',
'2019-03-01 12:00:00', '2019-03-01 13:00:00',
'2019-03-01 14:00:00', '2019-03-01 15:00:00',
'2019-03-01 16:00:00', '2019-03-01 17:00:00',
'2019-03-01 18:00:00', '2019-03-01 19:00:00',
'2019-03-01 20:00:00', '2019-03-01 21:00:00',
'2019-03-01 22:00:00', '2019-03-01 23:00:00',
'2019-03-02 00:00:00', '2019-03-02 01:00:00',
'2019-03-02 02:00:00', '2019-03-02 03:00:00',
'2019-03-02 04:00:00', '2019-03-02 05:00:00'],
dtype='datetime64[ns]', freq='H')

[31]: ts=pd.Series(
np.random.randint(1,10,len(z)),index=z)
ts.head()

[31]: 2019-03-01 00:00:00 7


2019-03-01 01:00:00 1
2019-03-01 02:00:00 6
2019-03-01 03:00:00 2
2019-03-01 04:00:00 7
Freq: H, dtype: int32

from file: 25-Working with Methods in Pandas - Part 2

25 Working with Methods in Pandas - Part 2


[1]: import pandas as pd
import numpy as np

133
25.1 to_datetime method
[2]: pd.to_datetime("15/08/2019")

C:\Users\lenovo\AppData\Local\Temp\ipykernel_15352\2752892349.py:1: UserWarning:
Parsing '15/08/2019' in DD/MM/YYYY format. Provide format or specify
infer_datetime_format=True for consistent parsing.
pd.to_datetime("15/08/2019")

[2]: Timestamp('2019-08-15 00:00:00')

[3]: date=["2019-01-05","jan 6, 2019",


"7/05/2019","2019/01/9","20190110"]

[4]: pd.to_datetime(date)

[4]: DatetimeIndex(['2019-01-05', '2019-01-06', '2019-07-05', '2019-01-09',


'2019-01-10'],
dtype='datetime64[ns]', freq=None)

[5]: pd.to_datetime("03/05/2019")

[5]: Timestamp('2019-03-05 00:00:00')

[6]: pd.to_datetime("05/03/2019", dayfirst=True)

[6]: Timestamp('2019-03-05 00:00:00')

[7]: pd.to_datetime("05*03*2019",
format="%d*%m*%Y" )

[7]: Timestamp('2019-03-05 00:00:00')

[8]: pd.to_datetime("05$03$2019",
format="%d$%m$%Y")

[8]: Timestamp('2019-03-05 00:00:00')

[9]: date=["2019-01-05",
"jan 6, 2019",
"7/05/2019",
"2019/01/9",
"20190110"]

[10]: pd.to_datetime(date)

[10]: DatetimeIndex(['2019-01-05', '2019-01-06', '2019-07-05', '2019-01-09',


'2019-01-10'],

134
dtype='datetime64[ns]', freq=None)

[11]: pd.to_datetime(date, errors='coerce')

[11]: DatetimeIndex(['2019-01-05', '2019-01-06', '2019-07-05', '2019-01-09',


'2019-01-10'],
dtype='datetime64[ns]', freq=None)

[12]: t=1000000000

[13]: pd.to_datetime(t, unit="s")

[13]: Timestamp('2001-09-09 01:46:40')

25.2 Frequency and Date Offsets


[14]: pd.date_range(
"2010-01-01","2010-01-03", freq="4h")

[14]: DatetimeIndex(['2010-01-01 00:00:00', '2010-01-01 04:00:00',


'2010-01-01 08:00:00', '2010-01-01 12:00:00',
'2010-01-01 16:00:00', '2010-01-01 20:00:00',
'2010-01-02 00:00:00', '2010-01-02 04:00:00',
'2010-01-02 08:00:00', '2010-01-02 12:00:00',
'2010-01-02 16:00:00', '2010-01-02 20:00:00',
'2010-01-03 00:00:00'],
dtype='datetime64[ns]', freq='4H')

[15]: pd.date_range(
"2010-01-01","2010-09-03",
freq="WOM-4SUN")

[15]: DatetimeIndex(['2010-01-24', '2010-02-28', '2010-03-28', '2010-04-25',


'2010-05-23', '2010-06-27', '2010-07-25', '2010-08-22'],
dtype='datetime64[ns]', freq='WOM-4SUN')

25.3 Period and PeriodIndex


[16]: p=pd.Period(2020)
p

[16]: Period('2020', 'A-DEC')

[17]: dir(p)

135
[17]: ['__add__',
'__array_priority__',
'__class__',
'__delattr__',
'__dict__',
'__dir__',
'__doc__',
'__eq__',
'__format__',
'__ge__',
'__getattribute__',
'__gt__',
'__hash__',
'__init__',
'__init_subclass__',
'__le__',
'__lt__',
'__module__',
'__ne__',
'__new__',
'__pyx_vtable__',
'__radd__',
'__reduce__',
'__reduce_ex__',
'__repr__',
'__rsub__',
'__setattr__',
'__setstate__',
'__sizeof__',
'__str__',
'__sub__',
'__subclasshook__',
'__weakref__',
'_add_offset',
'_add_timedeltalike_scalar',
'_dtype',
'_from_ordinal',
'_get_to_timestamp_base',
'_maybe_convert_freq',
'_require_matching_freq',
'asfreq',
'day',
'day_of_week',
'day_of_year',
'dayofweek',
'dayofyear',
'days_in_month',

136
'daysinmonth',
'end_time',
'freq',
'freqstr',
'hour',
'is_leap_year',
'minute',
'month',
'now',
'ordinal',
'quarter',
'qyear',
'second',
'start_time',
'strftime',
'to_timestamp',
'week',
'weekday',
'weekofyear',
'year']

[18]: p.start_time

[18]: Timestamp('2020-01-01 00:00:00')

[19]: p.end_time

[19]: Timestamp('2020-12-31 23:59:59.999999999')

[20]: a=pd.Period("2020-01", freq="M")


a

[20]: Period('2020-01', 'M')

[21]: a+5

[21]: Period('2020-06', 'M')

[22]: a-3

[22]: Period('2019-10', 'M')

[23]: p-pd.Period("2015")

[23]: <5 * YearEnds: month=12>

137
[24]: rng=pd.period_range(
"2019-01-01","2019-08-30",freq="M")
rng

[24]: PeriodIndex(['2019-01', '2019-02', '2019-03', '2019-04', '2019-05', '2019-06',


'2019-07', '2019-08'],
dtype='period[M]')

[25]: pd.Series(range(8),index=rng)

[25]: 2019-01 0
2019-02 1
2019-03 2
2019-04 3
2019-05 4
2019-06 5
2019-07 6
2019-08 7
Freq: M, dtype: int64

[26]: p=pd.Period("2019",freq="A-DEC")
p

[26]: Period('2019', 'A-DEC')

[27]: p.asfreq('M', how='start')

[27]: Period('2019-01', 'M')

[28]: p.asfreq("M",how="end")

[28]: Period('2019-12', 'M')

[29]: p=pd.Period("2019Q4", freq="Q-DEC")

[30]: p=pd.Period("2019Q4", freq="Q-FEB")


p

[30]: Period('2019Q4', 'Q-FEB')

[31]: p.end_time

[31]: Timestamp('2019-02-28 23:59:59.999999999')

[32]: p.asfreq('D', 'start')

[32]: Period('2018-12-01', 'D')

138
[33]: rng=pd.period_range(
"2019Q3","2020Q4", freq="Q-JAN")
rng

[33]: PeriodIndex(['2019Q3', '2019Q4', '2020Q1', '2020Q2', '2020Q3', '2020Q4'],


dtype='period[Q-JAN]')

[34]: ts=pd.Series(range(len(rng)),index=rng)
ts

[34]: 2019Q3 0
2019Q4 1
2020Q1 2
2020Q2 3
2020Q3 4
2020Q4 5
Freq: Q-JAN, dtype: int64

[35]: rng=pd.date_range(
"2020-01-01", periods=5, freq="M")

[36]: ts=pd.Series(range(len(rng)),index=rng)
ts

[36]: 2020-01-31 0
2020-02-29 1
2020-03-31 2
2020-04-30 3
2020-05-31 4
Freq: M, dtype: int64

[37]: pts=ts.to_period()
pts.index

[37]: PeriodIndex(['2020-01', '2020-02', '2020-03', '2020-04', '2020-05'],


dtype='period[M]')

[38]: pts.index

[38]: PeriodIndex(['2020-01', '2020-02', '2020-03', '2020-04', '2020-05'],


dtype='period[M]')

from file: 26-Important Methods for Time Series in Pandas

139
26 Important Methods for Time Series in Pandas
[1]: import pandas as pd ; import numpy as np

26.1 resampling method


[2]: fb=pd.read_csv("FB.csv",
parse_dates=["Date"],
index_col="Date")

[3]: fb.head()

[3]: Open High Low Close Adj Close \


Date
2018-07-30 175.300003 175.300003 166.559998 171.059998 171.059998
2018-07-31 170.669998 174.240005 170.000000 172.580002 172.580002
2018-08-01 173.929993 175.080002 170.899994 171.649994 171.649994
2018-08-02 170.679993 176.789993 170.270004 176.369995 176.369995
2018-08-03 177.690002 178.850006 176.149994 177.779999 177.779999

Volume
Date
2018-07-30 65280800
2018-07-31 40356500
2018-08-01 34042100
2018-08-02 32400000
2018-08-03 24763400

[4]: fb.resample("M").mean()

[4]: Open High Low Close Adj Close \


Date
2018-07-31 172.985001 174.770004 168.279999 171.820000 171.820000
2018-08-31 177.598695 179.433914 175.680868 177.492172 177.492172
2018-09-30 164.233158 166.399473 162.416843 164.377368 164.377368
2018-10-31 154.873479 157.124784 152.103045 154.187826 154.187826
2018-11-30 141.762857 143.658096 139.593336 141.635715 141.635715
2018-12-31 137.529475 140.493684 134.814209 137.161052 137.161052
2019-01-31 144.551904 147.329999 142.938095 145.422857 145.422857
2019-02-28 164.754735 166.534211 163.293158 164.813684 164.813684
2019-03-31 166.840477 168.799524 165.379525 167.411428 167.411428
2019-04-30 180.315239 182.084285 178.955712 180.544285 180.544285
2019-05-31 185.974545 187.902727 184.217273 186.082273 186.082273
2019-06-30 181.664500 183.892500 178.924001 181.274500 181.274500
2019-07-31 199.628499 201.783017 197.694002 200.097500 200.097500

Volume

140
Date
2018-07-31 5.281865e+07
2018-08-31 2.386229e+07
2018-09-30 2.634046e+07
2018-10-31 2.706288e+07
2018-11-30 2.467389e+07
2018-12-31 2.940980e+07
2019-01-31 2.512133e+07
2019-02-28 1.590754e+07
2019-03-31 1.847315e+07
2019-04-30 1.818978e+07
2019-05-31 1.303734e+07
2019-06-30 2.132143e+07
2019-07-31 1.543907e+07

[5]: %matplotlib inline

[6]: fb.Close.resample("M").mean().plot()

[6]: <AxesSubplot:xlabel='Date'>

[7]: fb.Close.resample("Q").mean().plot(
kind="bar")

141
[7]: <AxesSubplot:xlabel='Date'>

26.2 Shifting
[8]: fb1=pd.DataFrame(fb.Close["2019-03"])

[9]: fb1.head()

[9]: Close
Date
2019-03-01 162.279999
2019-03-04 167.369995
2019-03-05 171.259995
2019-03-06 172.509995
2019-03-07 169.130005

[10]: fb1.shift(2)

142
[10]: Close
Date
2019-03-01 NaN
2019-03-04 NaN
2019-03-05 162.279999
2019-03-06 167.369995
2019-03-07 171.259995
2019-03-08 172.509995
2019-03-11 169.130005
2019-03-12 169.600006
2019-03-13 172.070007
2019-03-14 171.919998
2019-03-15 173.369995
2019-03-18 170.169998
2019-03-19 165.979996
2019-03-20 160.470001
2019-03-21 161.570007
2019-03-22 165.440002
2019-03-25 166.080002
2019-03-26 164.339996
2019-03-27 166.289993
2019-03-28 167.679993
2019-03-29 165.869995

[11]: fb1.shift(-2)

[11]: Close
Date
2019-03-01 171.259995
2019-03-04 172.509995
2019-03-05 169.130005
2019-03-06 169.600006
2019-03-07 172.070007
2019-03-08 171.919998
2019-03-11 173.369995
2019-03-12 170.169998
2019-03-13 165.979996
2019-03-14 160.470001
2019-03-15 161.570007
2019-03-18 165.440002
2019-03-19 166.080002
2019-03-20 164.339996
2019-03-21 166.289993
2019-03-22 167.679993
2019-03-25 165.869995
2019-03-26 165.550003
2019-03-27 166.690002

143
2019-03-28 NaN
2019-03-29 NaN

[12]: fb1["Previous Price"]=fb1.shift(1)

[13]: fb1

[13]: Close Previous Price


Date
2019-03-01 162.279999 NaN
2019-03-04 167.369995 162.279999
2019-03-05 171.259995 167.369995
2019-03-06 172.509995 171.259995
2019-03-07 169.130005 172.509995
2019-03-08 169.600006 169.130005
2019-03-11 172.070007 169.600006
2019-03-12 171.919998 172.070007
2019-03-13 173.369995 171.919998
2019-03-14 170.169998 173.369995
2019-03-15 165.979996 170.169998
2019-03-18 160.470001 165.979996
2019-03-19 161.570007 160.470001
2019-03-20 165.440002 161.570007
2019-03-21 166.080002 165.440002
2019-03-22 164.339996 166.080002
2019-03-25 166.289993 164.339996
2019-03-26 167.679993 166.289993
2019-03-27 165.869995 167.679993
2019-03-28 165.550003 165.869995
2019-03-29 166.690002 165.550003

[14]: fb1["One Day Difference"]=fb1[


"Close"]-fb1["Previous Price"]

[15]: fb1.head()

[15]: Close Previous Price One Day Difference


Date
2019-03-01 162.279999 NaN NaN
2019-03-04 167.369995 162.279999 5.089996
2019-03-05 171.259995 167.369995 3.890000
2019-03-06 172.509995 171.259995 1.250000
2019-03-07 169.130005 172.509995 -3.379990

[16]: fb1["Percentage Change"]=(


fb1["Close"]-fb1[
"Previous Price"])*100/fb1[

144
"Previous Price"]

[17]: fb2=fb1[["Close"]]

[18]: fb2.head()

[18]: Close
Date
2019-03-01 162.279999
2019-03-04 167.369995
2019-03-05 171.259995
2019-03-06 172.509995
2019-03-07 169.130005

[19]: fb2.index

[19]: DatetimeIndex(['2019-03-01', '2019-03-04', '2019-03-05', '2019-03-06',


'2019-03-07', '2019-03-08', '2019-03-11', '2019-03-12',
'2019-03-13', '2019-03-14', '2019-03-15', '2019-03-18',
'2019-03-19', '2019-03-20', '2019-03-21', '2019-03-22',
'2019-03-25', '2019-03-26', '2019-03-27', '2019-03-28',
'2019-03-29'],
dtype='datetime64[ns]', name='Date', freq=None)

[20]: fb2.index=pd.date_range(
"2019-03-01", periods=21, freq="B")

[21]: fb2.index

[21]: DatetimeIndex(['2019-03-01', '2019-03-04', '2019-03-05', '2019-03-06',


'2019-03-07', '2019-03-08', '2019-03-11', '2019-03-12',
'2019-03-13', '2019-03-14', '2019-03-15', '2019-03-18',
'2019-03-19', '2019-03-20', '2019-03-21', '2019-03-22',
'2019-03-25', '2019-03-26', '2019-03-27', '2019-03-28',
'2019-03-29'],
dtype='datetime64[ns]', freq='B')

[22]: fb2.tshift(1)

<ipython-input-22-93d300b30aef>:1: FutureWarning: tshift is deprecated and will


be removed in a future version. Please use shift instead.
fb2.tshift(1)

[22]: Close
2019-03-04 162.279999
2019-03-05 167.369995
2019-03-06 171.259995

145
2019-03-07 172.509995
2019-03-08 169.130005
2019-03-11 169.600006
2019-03-12 172.070007
2019-03-13 171.919998
2019-03-14 173.369995
2019-03-15 170.169998
2019-03-18 165.979996
2019-03-19 160.470001
2019-03-20 161.570007
2019-03-21 165.440002
2019-03-22 166.080002
2019-03-25 164.339996
2019-03-26 166.289993
2019-03-27 167.679993
2019-03-28 165.869995
2019-03-29 165.550003
2019-04-01 166.690002

[23]: fb2.tshift(-2)

<ipython-input-23-2455d462e22f>:1: FutureWarning: tshift is deprecated and will


be removed in a future version. Please use shift instead.
fb2.tshift(-2)

[23]: Close
2019-02-27 162.279999
2019-02-28 167.369995
2019-03-01 171.259995
2019-03-04 172.509995
2019-03-05 169.130005
2019-03-06 169.600006
2019-03-07 172.070007
2019-03-08 171.919998
2019-03-11 173.369995
2019-03-12 170.169998
2019-03-13 165.979996
2019-03-14 160.470001
2019-03-15 161.570007
2019-03-18 165.440002
2019-03-19 166.080002
2019-03-20 164.339996
2019-03-21 166.289993
2019-03-22 167.679993
2019-03-25 165.869995
2019-03-26 165.550003
2019-03-27 166.690002

146
26.3 Moving Window Functions
[24]: fb.Close.plot()

[24]: <AxesSubplot:xlabel='Date'>

[25]: fb.Close.plot()
fb.Close.rolling(30).mean().plot()

[25]: <AxesSubplot:xlabel='Date'>

147
26.4 Time Zone Handling
[26]: import pytz

[27]: pytz.timezone("Turkey")

[27]: <DstTzInfo 'Turkey' LMT+1:56:00 STD>

[28]: pytz.timezone("America/New_York")

[28]: <DstTzInfo 'America/New_York' LMT-1 day, 19:04:00 STD>

[29]: pytz.common_timezones[-7:]

[29]: ['US/Arizona',
'US/Central',
'US/Eastern',
'US/Hawaii',
'US/Mountain',
'US/Pacific',
'UTC']

[30]: x=pd.date_range("12/9/2009 9:30",


periods=6, freq="D")

148
[31]: ts=pd.Series(np.random.randn(len(x)),
index=x)

[32]: ts

[32]: 2009-12-09 09:30:00 1.321881


2009-12-10 09:30:00 0.528298
2009-12-11 09:30:00 -0.320733
2009-12-12 09:30:00 1.692672
2009-12-13 09:30:00 -0.194015
2009-12-14 09:30:00 0.585790
Freq: D, dtype: float64

[33]: print(ts.index.tz)

None

[34]: ts_utc=ts.tz_localize("UTC")
ts_utc

[34]: 2009-12-09 09:30:00+00:00 1.321881


2009-12-10 09:30:00+00:00 0.528298
2009-12-11 09:30:00+00:00 -0.320733
2009-12-12 09:30:00+00:00 1.692672
2009-12-13 09:30:00+00:00 -0.194015
2009-12-14 09:30:00+00:00 0.585790
Freq: D, dtype: float64

[35]: ts_utc.tz_convert("US/Hawaii")

[35]: 2009-12-08 23:30:00-10:00 1.321881


2009-12-09 23:30:00-10:00 0.528298
2009-12-10 23:30:00-10:00 -0.320733
2009-12-11 23:30:00-10:00 1.692672
2009-12-12 23:30:00-10:00 -0.194015
2009-12-13 23:30:00-10:00 0.585790
Freq: D, dtype: float64

[36]: zstamp=pd.Timestamp("2019-06-26 05:00")


zstamp

[36]: Timestamp('2019-06-26 05:00:00')

[37]: zstamp_utc=zstamp.tz_localize("utc")
zstamp_utc

[37]: Timestamp('2019-06-26 05:00:00+0000', tz='UTC')

149
[38]: zstamp_utc.tz_convert("Europe/Istanbul")

[38]: Timestamp('2019-06-26 08:00:00+0300', tz='Europe/Istanbul')

[39]: ts

[39]: 2009-12-09 09:30:00 1.321881


2009-12-10 09:30:00 0.528298
2009-12-11 09:30:00 -0.320733
2009-12-12 09:30:00 1.692672
2009-12-13 09:30:00 -0.194015
2009-12-14 09:30:00 0.585790
Freq: D, dtype: float64

[40]: ts1=ts[:5].tz_localize("Europe/Berlin")
ts2=ts[2:].tz_localize("Europe/Istanbul")

[41]: result=ts1+ts2

[42]: result.index

[42]: DatetimeIndex(['2009-12-09 08:30:00+00:00', '2009-12-10 08:30:00+00:00',


'2009-12-11 07:30:00+00:00', '2009-12-11 08:30:00+00:00',
'2009-12-12 07:30:00+00:00', '2009-12-12 08:30:00+00:00',
'2009-12-13 07:30:00+00:00', '2009-12-13 08:30:00+00:00',
'2009-12-14 07:30:00+00:00'],
dtype='datetime64[ns, UTC]', freq=None)

from file: 27-Data Visualization with Pandas - Part 1

27 Data Visualization with Pandas - Part 1


[1]: import matplotlib.pyplot as plt
import numpy as np
import pandas as pd

[2]: %matplotlib inline

[3]: plt.style.use("fivethirtyeight")

27.1 Plotting graph with plot() method


[4]: data=pd.Series(np.random.randn(1000).cumsum())

[5]: data.plot()

150
[5]: <AxesSubplot:>

[6]: df1 = pd.DataFrame(np.random.randn(100, 4),columns=list('ABCD'))

[7]: df1 = df1.cumsum()

[8]: df1.plot()

[8]: <AxesSubplot:>

151
27.2 Bar Charts
[9]: df1.iloc[10].plot(kind='bar')

[9]: <AxesSubplot:>

152
[10]: df1.iloc[10].plot.bar()

[10]: <AxesSubplot:>

153
[11]: df2=pd.DataFrame(np.random.rand(7,3), columns=list("ABC"))

[12]: df2.plot.bar()

[12]: <AxesSubplot:>

[13]: df2.plot.bar(stacked=True)

[13]: <AxesSubplot:>

154
[14]: df2.plot.barh(stacked=True)

[14]: <AxesSubplot:>

155
27.3 Histograms
[15]: iris=pd.read_csv("iris.data", header=None)
iris.columns=["sepal_length","sepal_width", "petal_length",
"petal_width", "species"]

You can access this data set here.


[16]: iris.head()

[16]: sepal_length sepal_width petal_length petal_width species


0 5.1 3.5 1.4 0.2 Iris-setosa
1 4.9 3.0 1.4 0.2 Iris-setosa
2 4.7 3.2 1.3 0.2 Iris-setosa
3 4.6 3.1 1.5 0.2 Iris-setosa
4 5.0 3.6 1.4 0.2 Iris-setosa

[17]: iris.plot.hist(alpha=0.7)

[17]: <AxesSubplot:ylabel='Frequency'>

[18]: iris.plot.hist(alpha=1, stacked=True)

[18]: <AxesSubplot:ylabel='Frequency'>

156
[19]: bins=25

[20]: iris.plot.hist(alpha=1, stacked=True, bins=25)

[20]: <AxesSubplot:ylabel='Frequency'>

157
[21]: iris["sepal_width"].plot.hist(orientation="horizontal")

[21]: <AxesSubplot:xlabel='Frequency'>

[22]: iris["sepal_length"].diff().hist()

[22]: <AxesSubplot:>

158
[23]: iris.hist(color="blue", alpha=1, bins=20)

[23]: array([[<AxesSubplot:title={'center':'sepal_length'}>,
<AxesSubplot:title={'center':'sepal_width'}>],
[<AxesSubplot:title={'center':'petal_length'}>,
<AxesSubplot:title={'center':'petal_width'}>]], dtype=object)

159
[24]: iris.hist("petal_length",by="species")

[24]: array([[<AxesSubplot:title={'center':'Iris-setosa'}>,
<AxesSubplot:title={'center':'Iris-versicolor'}>],
[<AxesSubplot:title={'center':'Iris-virginica'}>, <AxesSubplot:>]],
dtype=object)

160
27.4 Boxplot charts
[25]: iris.plot.box()

[25]: <AxesSubplot:>

161
[26]: colors={'boxes': 'Red', 'whiskers': 'blue','medians': 'Black', 'caps': 'Green'}

[27]: iris.plot.box(color=colors)

[27]: <AxesSubplot:>

162
[28]: iris.plot.box(vert=False)

[28]: <AxesSubplot:>

163
[29]: iris.boxplot()

[29]: <AxesSubplot:>

[30]: plt.rcParams["figure.figsize"]=(8,8)
plt.style.use("ggplot")
iris.boxplot(by='species')

[30]: array([[<AxesSubplot:title={'center':'petal_length'}, xlabel='[species]'>,


<AxesSubplot:title={'center':'petal_width'}, xlabel='[species]'>],
[<AxesSubplot:title={'center':'sepal_length'}, xlabel='[species]'>,
<AxesSubplot:title={'center':'sepal_width'}, xlabel='[species]'>]],
dtype=object)

164
from file: 28-Data Visualization with Pandas - Part 2

28 Data Visualization with Pandas - Part 2


[1]: import matplotlib.pyplot as plt
import numpy as np
import pandas as pd

[2]: plt.style.use("fivethirtyeight")

165
28.1 Area Charts
[3]: df = pd.DataFrame(np.random.rand(10, 4), columns=list("ABCD"))
df.head()

[3]: A B C D
0 0.304421 0.862368 0.924565 0.216823
1 0.278892 0.909381 0.955020 0.877742
2 0.537353 0.113712 0.967656 0.526703
3 0.588321 0.331512 0.036084 0.299474
4 0.253378 0.175976 0.610472 0.326040

[4]: df["A"].plot.area()

[4]: <AxesSubplot:>

[5]: df.plot.area()

[5]: <AxesSubplot:>

166
[6]: df.plot.area(stacked=False)

[6]: <AxesSubplot:>

167
[7]: iris=pd.read_csv("iris.data", header=None)

[8]: iris.columns=["sepal_length","sepal_width", "petal_length",


"petal_width", "species"]

[9]: iris.dtypes

[9]: sepal_length float64


sepal_width float64
petal_length float64
petal_width float64
species object
dtype: object

[10]: iris.plot.area()

[10]: <AxesSubplot:>

[11]: iris.plot.area(stacked=False)

[11]: <AxesSubplot:>

168
28.2 Scatter Plots
[12]: df.plot.scatter(x='A', y='B')

[12]: <AxesSubplot:xlabel='A', ylabel='B'>

169
[13]: movies=pd.read_csv("imdbratings.txt")

[14]: movies.head()

[14]: star_rating title content_rating genre duration \


0 9.3 The Shawshank Redemption R Crime 142
1 9.2 The Godfather R Crime 175
2 9.1 The Godfather: Part II R Crime 200
3 9.0 The Dark Knight PG-13 Action 152
4 8.9 Pulp Fiction R Crime 154

actors_list
0 [u'Tim Robbins', u'Morgan Freeman', u'Bob Gunt…
1 [u'Marlon Brando', u'Al Pacino', u'James Caan']
2 [u'Al Pacino', u'Robert De Niro', u'Robert Duv…
3 [u'Christian Bale', u'Heath Ledger', u'Aaron E…
4 [u'John Travolta', u'Uma Thurman', u'Samuel L…

[15]: movies.dtypes

[15]: star_rating float64


title object
content_rating object
genre object

170
duration int64
actors_list object
dtype: object

[16]: movies.plot.scatter(x='star_rating', y='duration')

[16]: <AxesSubplot:xlabel='star_rating', ylabel='duration'>

[17]: ax=iris.plot.scatter(x='sepal_length', y='sepal_width',


color='Blue', label='sepal')
iris.plot.scatter(x='petal_length', y='petal_width', color='red',
label='petal', ax=ax)

[17]: <AxesSubplot:xlabel='petal_length', ylabel='petal_width'>

171
[18]: iris.plot.scatter(x='sepal_length', y='sepal_width',
c='petal_length', s=100)

[18]: <AxesSubplot:xlabel='sepal_length', ylabel='sepal_width'>

172
[19]: iris.plot.scatter(x='sepal_length', y='sepal_width',
s=iris['petal_length'] * 50)

[19]: <AxesSubplot:xlabel='sepal_length', ylabel='sepal_width'>

173
28.3 Hexagonal bin charts
[20]: movies.plot.hexbin(x="star_rating", y="duration", gridsize=25)

[20]: <AxesSubplot:xlabel='star_rating', ylabel='duration'>

174
[21]: movies.plot.hexbin(x="star_rating", y="duration", gridsize=10)

[21]: <AxesSubplot:xlabel='star_rating', ylabel='duration'>

175
28.4 Pie Charts
[22]: iris_avg=iris["petal_width"].groupby(iris["species"]).mean()
iris_avg

[22]: species
Iris-setosa 0.244
Iris-versicolor 1.326
Iris-virginica 2.026
Name: petal_width, dtype: float64

[23]: iris_avg.plot.pie()

[23]: <AxesSubplot:ylabel='petal_width'>

176
[24]: iris_avg_2=iris[["petal_width",
"petal_length"]].groupby(iris["species"]).mean()

[25]: iris_avg_2.plot.pie(subplots=True)

[25]: array([<AxesSubplot:ylabel='petal_width'>,
<AxesSubplot:ylabel='petal_length'>], dtype=object)

[26]: iris_avg.plot.pie()

177
[26]: <AxesSubplot:ylabel='petal_width'>

[27]: iris_avg.plot.pie(labels=["setosa","versicolor", "virginica"],


colors=list("brg"), fontsize=25, figsize=(10,10))

[27]: <AxesSubplot:ylabel='petal_width'>

178
[28]: iris_avg.plot.pie(labels=["setosa","versicolor", "virginica"],␣
↪colors=list("brg"),

autopct='%.2f', fontsize=25, figsize=(10,10))

[28]: <AxesSubplot:ylabel='petal_width'>

179
28.5 Density chart
[29]: iris.plot.kde()

[29]: <AxesSubplot:ylabel='Density'>

180
28.6 Scatter matrix
[30]: from pandas.plotting import scatter_matrix

[31]: scatter_matrix(movies, alpha=0.5, diagonal='kde')

[31]: array([[<AxesSubplot:xlabel='star_rating', ylabel='star_rating'>,


<AxesSubplot:xlabel='duration', ylabel='star_rating'>],
[<AxesSubplot:xlabel='star_rating', ylabel='duration'>,
<AxesSubplot:xlabel='duration', ylabel='duration'>]], dtype=object)

181
from file: MERGED_NOTEBOOK

29 Merged Jupyter Notebook


from file: 28-Data Visualization with Pandas - Part 2

30 Data Visualization with Pandas - Part 2


[1]: import matplotlib.pyplot as plt
import numpy as np
import pandas as pd

[2]: plt.style.use("fivethirtyeight")

30.1 Area Charts


[3]: df = pd.DataFrame(np.random.rand(10, 4), columns=list("ABCD"))
df.head()

[3]: A B C D
0 0.304421 0.862368 0.924565 0.216823
1 0.278892 0.909381 0.955020 0.877742

182
2 0.537353 0.113712 0.967656 0.526703
3 0.588321 0.331512 0.036084 0.299474
4 0.253378 0.175976 0.610472 0.326040

[4]: df["A"].plot.area()

[4]: <AxesSubplot:>

[5]: df.plot.area()

[5]: <AxesSubplot:>

183
[6]: df.plot.area(stacked=False)

[6]: <AxesSubplot:>

184
[7]: iris=pd.read_csv("iris.data", header=None)

[8]: iris.columns=["sepal_length","sepal_width", "petal_length",


"petal_width", "species"]

[9]: iris.dtypes

[9]: sepal_length float64


sepal_width float64
petal_length float64
petal_width float64
species object
dtype: object

[10]: iris.plot.area()

[10]: <AxesSubplot:>

[11]: iris.plot.area(stacked=False)

[11]: <AxesSubplot:>

185
30.2 Scatter Plots
[12]: df.plot.scatter(x='A', y='B')

[12]: <AxesSubplot:xlabel='A', ylabel='B'>

186
[13]: movies=pd.read_csv("imdbratings.txt")

[14]: movies.head()

[14]: star_rating title content_rating genre duration \


0 9.3 The Shawshank Redemption R Crime 142
1 9.2 The Godfather R Crime 175
2 9.1 The Godfather: Part II R Crime 200
3 9.0 The Dark Knight PG-13 Action 152
4 8.9 Pulp Fiction R Crime 154

actors_list
0 [u'Tim Robbins', u'Morgan Freeman', u'Bob Gunt…
1 [u'Marlon Brando', u'Al Pacino', u'James Caan']
2 [u'Al Pacino', u'Robert De Niro', u'Robert Duv…
3 [u'Christian Bale', u'Heath Ledger', u'Aaron E…
4 [u'John Travolta', u'Uma Thurman', u'Samuel L…

[15]: movies.dtypes

[15]: star_rating float64


title object
content_rating object
genre object

187
duration int64
actors_list object
dtype: object

[16]: movies.plot.scatter(x='star_rating', y='duration')

[16]: <AxesSubplot:xlabel='star_rating', ylabel='duration'>

[17]: ax=iris.plot.scatter(x='sepal_length', y='sepal_width',


color='Blue', label='sepal')
iris.plot.scatter(x='petal_length', y='petal_width', color='red',
label='petal', ax=ax)

[17]: <AxesSubplot:xlabel='petal_length', ylabel='petal_width'>

188
[18]: iris.plot.scatter(x='sepal_length', y='sepal_width',
c='petal_length', s=100)

[18]: <AxesSubplot:xlabel='sepal_length', ylabel='sepal_width'>

189
[19]: iris.plot.scatter(x='sepal_length', y='sepal_width',
s=iris['petal_length'] * 50)

[19]: <AxesSubplot:xlabel='sepal_length', ylabel='sepal_width'>

190
30.3 Hexagonal bin charts
[20]: movies.plot.hexbin(x="star_rating", y="duration", gridsize=25)

[20]: <AxesSubplot:xlabel='star_rating', ylabel='duration'>

191
[21]: movies.plot.hexbin(x="star_rating", y="duration", gridsize=10)

[21]: <AxesSubplot:xlabel='star_rating', ylabel='duration'>

192
30.4 Pie Charts
[22]: iris_avg=iris["petal_width"].groupby(iris["species"]).mean()
iris_avg

[22]: species
Iris-setosa 0.244
Iris-versicolor 1.326
Iris-virginica 2.026
Name: petal_width, dtype: float64

[23]: iris_avg.plot.pie()

[23]: <AxesSubplot:ylabel='petal_width'>

193
[24]: iris_avg_2=iris[["petal_width",
"petal_length"]].groupby(iris["species"]).mean()

[25]: iris_avg_2.plot.pie(subplots=True)

[25]: array([<AxesSubplot:ylabel='petal_width'>,
<AxesSubplot:ylabel='petal_length'>], dtype=object)

[26]: iris_avg.plot.pie()

194
[26]: <AxesSubplot:ylabel='petal_width'>

[27]: iris_avg.plot.pie(labels=["setosa","versicolor", "virginica"],


colors=list("brg"), fontsize=25, figsize=(10,10))

[27]: <AxesSubplot:ylabel='petal_width'>

195
[28]: iris_avg.plot.pie(labels=["setosa","versicolor", "virginica"],␣
↪colors=list("brg"),

autopct='%.2f', fontsize=25, figsize=(10,10))

[28]: <AxesSubplot:ylabel='petal_width'>

196
30.5 Density chart
[29]: iris.plot.kde()

[29]: <AxesSubplot:ylabel='Density'>

197
30.6 Scatter matrix
[30]: from pandas.plotting import scatter_matrix

[31]: scatter_matrix(movies, alpha=0.5, diagonal='kde')

[31]: array([[<AxesSubplot:xlabel='star_rating', ylabel='star_rating'>,


<AxesSubplot:xlabel='duration', ylabel='star_rating'>],
[<AxesSubplot:xlabel='star_rating', ylabel='duration'>,
<AxesSubplot:xlabel='duration', ylabel='duration'>]], dtype=object)

198
199

You might also like