pandas
pandas
[ ]:
1 What is Pandas?
Pandas is one of the most important libraries of Python. Pandas has data structures for data anal-
ysis. The most used of these are Series and DataFrame data structures. Series is one dimensional,
that is, it consists of a column. Data frame is two-dimensional, i.e. it consists of rows and columns.
To install Pandas, you can use “pip install pandas”
[1]: import pandas as pd # Let's import pandas with pd
[2]: '1.1.3'
[2]: obj=pd.Series([1,"John",3.5,"Hey"])
obj
[2]: 0 1
1 John
2 3.5
3 Hey
dtype: object
[3]: obj[0]
[3]: 1
[4]: obj.values
1
[4]: array([1, 'John', 3.5, 'Hey'], dtype=object)
[5]: obj2=pd.Series([1,"John",3.5,"Hey"],index=["a","b","c","d"])
obj2
[5]: a 1
b John
c 3.5
d Hey
dtype: object
[6]: obj2["b"]
[6]: 'John'
[7]: obj2.index
[8]: Jane 90
Bill 80
Elon 85
Tom 75
Tim 95
dtype: int64
[9]: names["Tim"]
[9]: 95
[10]: names[names>=85]
[10]: Jane 90
Elon 85
Tim 95
dtype: int64
[11]: names["Tom"]=60
names
[11]: Jane 90
Bill 80
Elon 85
2
Tom 60
Tim 95
dtype: int64
[12]: names[names<=80]=83
names
[12]: Jane 90
Bill 83
Elon 85
Tom 83
Tim 95
dtype: int64
[13]: True
[14]: False
[15]: names/10
[16]: names**2
[17]: names.isnull()
3
dtype: bool
[2]: games=pd.read_csv("Data/vgsalesGlobale.csv")
[3]: games.head()
[4]: games.dtypes
[5]: games.Genre.describe()
4
freq 3316
Name: Genre, dtype: object
[6]: games.Genre.value_counts()
[7]: games.Genre.value_counts(normalize=True)
[8]: type(games.Genre.value_counts())
[8]: pandas.core.series.Series
[9]: games.Genre.value_counts().head()
5
[10]: games.Genre.unique()
[11]: games.Genre.nunique()
[11]: 12
[12]: Year 1980.0 1981.0 1982.0 1983.0 1984.0 1985.0 1986.0 1987.0 \
Genre
Action 1 25 18 7 1 2 6 2
Adventure 0 0 0 1 0 0 0 1
Fighting 1 0 0 0 0 1 0 2
Misc 4 0 1 1 1 0 0 0
Platform 0 3 5 5 1 4 6 2
Puzzle 0 2 3 1 3 4 0 0
Racing 0 1 2 0 3 0 1 0
Role-Playing 0 0 0 0 0 0 1 3
Shooter 2 10 5 1 3 1 4 2
Simulation 0 1 0 0 0 1 0 0
Sports 1 4 2 1 2 1 3 4
Strategy 0 0 0 0 0 0 0 0
6
Misc 41 39 18 0 0
Platform 10 14 10 0 0
Puzzle 8 6 0 0 0
Racing 27 19 20 0 0
Role-Playing 91 78 40 2 0
Shooter 47 34 32 0 0
Simulation 11 15 9 0 1
Sports 55 62 38 0 0
Strategy 8 17 10 0 0
[13]: games.Global_Sales.describe()
[14]: games.Global_Sales.mean()
[14]: 0.5374406555006628
[15]: games.Global_Sales.value_counts()
[16]: games.Year.plot(kind="hist")
[16]: <AxesSubplot:ylabel='Frequency'>
7
[17]: games.Genre.value_counts()
[18]: games.Genre.value_counts().plot(kind="bar")
[18]: <AxesSubplot:>
8
from file: 04-DataFrame
4 DataFrame
4.1 What is DataFrame?
[1]: import pandas as pd
[2]: data={"name":["Bill","Tom","Tim","John","Alex","Vanessa","Kate"],
"score":[90,80,85,75,95,60,65],
"sport":["Wrestling","Football","Skiing","Swimming","Tennis",
"Karete","Surfing"],
"sex":["M","M","M","M","F","F","F"]}
[3]: df=pd.DataFrame(data)
[4]: df
9
3 John 75 Swimming M
4 Alex 95 Tennis F
5 Vanessa 60 Karete F
6 Kate 65 Surfing F
[5]: df=pd.DataFrame(data,columns=["name","sport","sex","score"])
df
[6]: df.head()
[7]: df.tail()
[8]: df.tail(3)
[9]: df.head(2)
10
[10]: df=pd.DataFrame(data,columns=["name", "sport", "gender", "score", "age"])
df
[12]: df["sport"]
[13]: my_columns=["name","sport"]
df[my_columns]
11
[14]: df.sport
[15]: df.loc[["one"]]
[16]: df.loc[["one","two"]]
[17]: df["age"]=18
[19]: df["pass"]=df.score>=70
df
12
five Alex Tennis NaN 95 17 True
six Vanessa Karete NaN 60 17 False
seven Kate Surfing NaN 65 18 False
[22]: scores_df=pd.DataFrame(scores)
scores_df
[23]: scores_df.T
[23]: A B C
Math 85 90 95
Physics 90 80 75
[24]: scores_df.index.name="name"
scores_df.columns.name="lesson"
[25]: scores_df
[26]: scores_df.values
13
[95, 75]], dtype=int64)
[27]: scores_index=scores_df.index
[28]: scores_index[1]="Jack"
scores_index
---------------------------------------------------------------------------
TypeError Traceback (most recent call last)
<ipython-input-28-e9a61a166ed1> in <module>
----> 1 scores_index[1]="Jack"
2 scores_index
~\anaconda3\envs\tensorflow\lib\site-packages\pandas\core\indexes\base.py in␣
↪__setitem__(self, key, value)
4079
4080 def __setitem__(self, key, value):
-> 4081 raise TypeError("Index does not support mutable operations")
4082
4083 def __getitem__(self, key):
[2]: obj=pd.Series(np.arange(5),
index=["a","b","c","d","e"])
[3]: obj
[3]: a 0
b 1
c 2
d 3
e 4
dtype: int32
[4]: obj["c"]
[4]: 2
14
[5]: obj[2]
[5]: 2
[6]: obj[0:3]
[6]: a 0
b 1
c 2
dtype: int32
[7]: obj[["a","c"]]
[7]: a 0
c 2
dtype: int32
[8]: obj[[0,2]]
[8]: a 0
c 2
dtype: int32
[9]: obj[obj<2]
[9]: a 0
b 1
dtype: int32
[10]: obj["a":"c"]
[10]: a 0
b 1
c 2
dtype: int32
[11]: obj["b":"c"]=5
obj
[11]: a 0
b 5
c 5
d 3
e 4
dtype: int32
15
5.1 DataFrame Indexing
[12]: data=pd.DataFrame(
np.arange(16).reshape(4,4),
index=["London","Paris",
"Berlin","Istanbul"],
columns=["one","two","three","four"])
data
[13]: data["two"]
[13]: London 1
Paris 5
Berlin 9
Istanbul 13
Name: two, dtype: int32
[14]: data[["one","two"]]
[15]: data[:3]
[16]: data[data["four"]>5]
[17]: data[data<5]=0
data
16
[17]: one two three four
London 0 0 0 0
Paris 0 5 6 7
Berlin 8 9 10 11
Istanbul 12 13 14 15
[18]: one 0
two 5
three 6
four 7
Name: Paris, dtype: int32
[19]: data.iloc[1,[1,2,3]]
[19]: two 5
three 6
four 7
Name: Paris, dtype: int32
[20]: data.iloc[[1,3],[1,2,3]]
[21]: data.loc["Paris",["one","two"]]
[21]: one 0
two 5
Name: Paris, dtype: int32
[22]: data.loc[:"Paris","four"]
[22]: London 0
Paris 7
Name: four, dtype: int32
[23]: toy_data=pd.Series(np.arange(5),
index=["a","b","c",
"d","e"])
toy_data
17
[23]: a 0
b 1
c 2
d 3
e 4
dtype: int32
[24]: toy_data[-1]
[24]: 4
[2]: s=pd.Series([1,2,3,4],
index=["a","b","c","d"])
s
[2]: a 1
b 2
c 3
d 4
dtype: int64
[3]: s["a"]
[3]: 1
[4]: s2=s.reindex(["b","d","a","c","e"])
s2
[4]: b 2.0
d 4.0
a 1.0
c 3.0
e NaN
dtype: float64
[5]: s3=pd.Series(["blue","yellow","purple"],
index=[0,2,4])
s3
18
[5]: 0 blue
2 yellow
4 purple
dtype: object
[6]: s3.reindex(range(6),method="ffill")
[6]: 0 blue
1 blue
2 yellow
3 yellow
4 purple
5 purple
dtype: object
[7]: df=pd.DataFrame(np.arange(9).reshape(3,3),
index=["a","c","d"],
columns=["Tim","Tom","Kate"])
df
[8]: df2=df.reindex(["d","c","b","a"])
df2
[9]: names=["Kate","Tim","Tom"]
df.reindex(columns=names)
[10]: df.loc[["c","d","a"]]
19
a 0 1 2
[11]: s=pd.Series(np.arange(5.),
index=["a","b","c","d","e"])
s
[11]: a 0.0
b 1.0
c 2.0
d 3.0
e 4.0
dtype: float64
[12]: new_s=s.drop("b")
new_s
[12]: a 0.0
c 2.0
d 3.0
e 4.0
dtype: float64
[13]: s.drop(["c","d"])
[13]: a 0.0
b 1.0
e 4.0
dtype: float64
[14]: data=pd.DataFrame(np.arange(16).reshape(4,4),
index=["Kate","Tim",
"Tom","Alex"],
columns=list("ABCD"))
data
[14]: A B C D
Kate 0 1 2 3
Tim 4 5 6 7
Tom 8 9 10 11
Alex 12 13 14 15
[15]: data.drop(["Kate","Tim"])
[15]: A B C D
Tom 8 9 10 11
Alex 12 13 14 15
20
[16]: data.drop("A",axis=1)
[16]: B C D
Kate 1 2 3
Tim 5 6 7
Tom 9 10 11
Alex 13 14 15
[17]: data.drop("Kate",axis=0)
[17]: A B C D
Tim 4 5 6 7
Tom 8 9 10 11
Alex 12 13 14 15
[18]: data
[18]: A B C D
Kate 0 1 2 3
Tim 4 5 6 7
Tom 8 9 10 11
Alex 12 13 14 15
[19]: data.mean(axis="index")
[19]: A 6.0
B 7.0
C 8.0
D 9.0
dtype: float64
[20]: data.mean(axis="columns")
21
[2]: s1=pd.Series(np.arange(4),
index=["a","c","d","e"])
s2=pd.Series(np.arange(5),
index=["a","c","e","f","g"])
[3]: print(s1)
print(s2)
a 0
c 1
d 2
e 3
dtype: int32
a 0
c 1
e 2
f 3
g 4
dtype: int32
[4]: s1+s2
[4]: a 0.0
c 2.0
d NaN
e 5.0
f NaN
g NaN
dtype: float64
[5]: df1=pd.DataFrame(
np.arange(6).reshape(2,3),
columns=list("ABC"),
index=["Tim","Tom"])
df2=pd.DataFrame(
np.arange(9).reshape(3,3),
columns=list("ACD"),
index=["Tim","Kate","Tom"])
[6]: print(df1)
print(df2)
A B C
Tim 0 1 2
Tom 3 4 5
A C D
Tim 0 1 2
22
Kate 3 4 5
Tom 6 7 8
[7]: df1+df2
[7]: A B C D
Kate NaN NaN NaN NaN
Tim 0.0 NaN 3.0 NaN
Tom 9.0 NaN 12.0 NaN
[8]: df1.add(df2,fill_value=0)
[8]: A B C D
Kate 3.0 NaN 4.0 5.0
Tim 0.0 1.0 3.0 2.0
Tom 9.0 4.0 12.0 8.0
[9]: 1/df1
[9]: A B C
Tim inf 1.00 0.5
Tom 0.333333 0.25 0.2
[10]: df1*3
[10]: A B C
Tim 0 3 6
Tom 9 12 15
[11]: df1.mul(3)
[11]: A B C
Tim 0 3 6
Tom 9 12 15
[12]: df2
[12]: A C D
Tim 0 1 2
Kate 3 4 5
Tom 6 7 8
[13]: s=df2.iloc[1]
s
[13]: A 3
C 4
23
D 5
Name: Kate, dtype: int32
[14]: df2-s
[14]: A C D
Tim -3 -3 -3
Kate 0 0 0
Tom 3 3 3
[15]: s2=df2["A"]
s2
[15]: Tim 0
Kate 3
Tom 6
Name: A, dtype: int32
[16]: df2.sub(s2,axis="index")
[16]: A C D
Tim 0 1 2
Kate 0 1 2
Tom 0 1 2
[17]: A B C
Kim -0.173717 -1.126917 -0.595042
Susan -0.641672 -0.073913 -1.828588
Tim -0.389124 -1.786140 0.553646
Tom -0.062436 -0.251933 0.872391
[18]: np.abs(df)
[18]: A B C
Kim 0.173717 1.126917 0.595042
Susan 0.641672 0.073913 1.828588
Tim 0.389124 1.786140 0.553646
Tom 0.062436 0.251933 0.872391
24
[19]: f=lambda x:x.max()-x.min()
[20]: df.apply(f)
[20]: A 0.579236
B 1.712227
C 2.700979
dtype: float64
[21]: df.apply(f,axis=1)
[23]: df.apply(f)
[23]: A B C
Kim 0.030178 1.269942 0.354075
Susan 0.411743 0.005463 3.343735
Tim 0.151417 3.190294 0.306524
Tom 0.003898 0.063470 0.761066
[ ]:
[2]: s=pd.Series(range(5),
index=["e","d","a","b","c"])
s
[2]: e 0
d 1
a 2
b 3
c 4
25
dtype: int64
[3]: s.sort_index()
[3]: a 2
b 3
c 4
d 1
e 0
dtype: int64
[4]: df=pd.DataFrame(
np.arange(12).reshape(3,4),
index=["two","one","three"],
columns=["d","a","b","c"])
df
[4]: d a b c
two 0 1 2 3
one 4 5 6 7
three 8 9 10 11
[5]: df.sort_index()
[5]: d a b c
one 4 5 6 7
three 8 9 10 11
two 0 1 2 3
[6]: df.sort_index(axis=1)
[6]: a b c d
two 1 2 3 0
one 5 6 7 4
three 9 10 11 8
[7]: d c b a
two 0 3 2 1
one 4 7 6 5
three 8 11 10 9
[8]: s2=pd.Series([5,np.nan,3,-1,9])
s2
26
[8]: 0 5.0
1 NaN
2 3.0
3 -1.0
4 9.0
dtype: float64
[9]: s2.sort_values()
[9]: 3 -1.0
2 3.0
0 5.0
4 9.0
1 NaN
dtype: float64
[10]: df2=pd.DataFrame(
{"a":[5,3,-1,9],"b":[1,-2,0,5]})
df2
[10]: a b
0 5 1
1 3 -2
2 -1 0
3 9 5
[11]: df2.sort_values(by="b")
[11]: a b
1 3 -2
2 -1 0
0 5 1
3 9 5
[12]: df2.sort_values(by=["b","a"])
[12]: a b
1 3 -2
2 -1 0
0 5 1
3 9 5
8.1 Practice
Let’s practice using a real data set. You can download data set from
https://www.kaggle.com/melodyxyz/global.
27
[13]: data=pd.read_csv("Data/vgsalesGlobale.csv")
[14]: data.head()
[15]: data["Name"].sort_values(ascending=False)
[16]: data.sort_values("Name")
28
Genre Publisher NA_Sales EU_Sales JP_Sales \
4754 Sports Magical Company 0.15 0.10 0.12
8357 Role-Playing Namco Bandai Games 0.00 0.00 0.17
7107 Role-Playing Namco Bandai Games 0.11 0.09 0.00
8602 Role-Playing Namco Bandai Games 0.00 0.00 0.16
8304 Role-Playing Namco Bandai Games 0.00 0.00 0.17
… … … … … …
627 Misc THQ 1.67 0.58 0.00
7835 Misc THQ 0.08 0.09 0.00
15523 Misc THQ 0.01 0.01 0.00
470 Fighting NaN 1.57 1.02 0.00
9135 Platform 505 Games 0.00 0.00 0.14
Other_Sales Global_Sales
4754 0.03 0.41
8357 0.00 0.17
7107 0.03 0.23
8602 0.00 0.16
8304 0.00 0.17
… … …
627 0.20 2.46
7835 0.02 0.19
15523 0.00 0.02
470 0.41 3.00
9135 0.00 0.14
[17]: data.sort_values("Year")
29
5366 Activision 0.32 0.02 0.0 0.00
1969 Atari 0.99 0.05 0.0 0.01
1766 Activision 1.07 0.07 0.0 0.01
… … … … … …
16307 Unknown 0.01 0.00 0.0 0.00
16327 Namco Bandai Games 0.01 0.00 0.0 0.00
16366 Unknown 0.01 0.00 0.0 0.00
16427 Unknown 0.01 0.00 0.0 0.00
16493 Unknown 0.00 0.01 0.0 0.00
Global_Sales
6896 0.24
2669 0.77
5366 0.34
1969 1.05
1766 1.15
… …
16307 0.01
16327 0.01
16366 0.01
16427 0.01
16493 0.01
[18]: data.sort_values(["Year","Name"])
30
7351 Fighting NaN 0.10 0.08 0.00
15476 Racing Unknown 0.00 0.00 0.02
11409 Action Nintendo 0.00 0.00 0.08
8899 Misc Empire Interactive 0.07 0.06 0.00
470 Fighting NaN 1.57 1.02 0.00
Other_Sales Global_Sales
258 0.05 4.31
2669 0.01 0.77
6317 0.00 0.27
6896 0.00 0.24
1969 0.01 1.05
… … …
7351 0.03 0.21
15476 0.00 0.02
11409 0.00 0.08
8899 0.02 0.15
470 0.41 3.00
[2]: df=pd.DataFrame(
[[2.4,np.nan],[6.3,-5.4],
[np.nan,np.nan],[0.75,-1.3]],
index=["a","b","c","d"],
columns=["one","two"])
df
[3]: df.sum()
31
[4]: df.sum(axis=1)
[4]: a 2.40
b 0.90
c 0.00
d -0.55
dtype: float64
[5]: df.mean(axis=1)
[5]: a 2.400
b 0.450
c NaN
d -0.275
dtype: float64
[6]: df.mean(axis=1,skipna=False)
[6]: a NaN
b 0.450
c NaN
d -0.275
dtype: float64
[7]: df.idxmax()
[7]: one b
two d
dtype: object
[8]: df.idxmin()
[8]: one d
two b
dtype: object
[9]: df.cumsum()
[10]: df.describe()
32
[10]: one two
count 3.000 2.000000
mean 3.150 -3.350000
std 2.850 2.899138
min 0.750 -5.400000
25% 1.575 -4.375000
50% 2.400 -3.350000
75% 4.350 -2.325000
max 6.300 -1.300000
To find the correlation coefficient, let’s first import the famous iris data set. You can download iris
data set from https://archive.ics.uci.edu/ml/datasets/iris.
[11]: iris=pd.read_csv("Data/iris.data",
sep=",",
header=None)
[12]: iris.head()
[12]: 0 1 2 3 4
0 5.1 3.5 1.4 0.2 Iris-setosa
1 4.9 3.0 1.4 0.2 Iris-setosa
2 4.7 3.2 1.3 0.2 Iris-setosa
3 4.6 3.1 1.5 0.2 Iris-setosa
4 5.0 3.6 1.4 0.2 Iris-setosa
[13]: iris.columns=['sepal_length','sepal_width',
'petal_length','petal_width',
'class']
[14]: iris.head()
[15]: iris["sepal_length"].corr(iris["sepal_width"])
[15]: -0.10936924995064938
[16]: iris.corr()
33
sepal_width -0.109369 1.000000 -0.420516 -0.356544
petal_length 0.871754 -0.420516 1.000000 0.962757
petal_width 0.817954 -0.356544 0.962757 1.000000
[17]: iris.cov()
[18]: iris.corrwith(iris.petal_length)
[19]: s=pd.Series(["b","b","b","b","c",
"c","a","a","a"])
s
[19]: 0 b
1 b
2 b
3 b
4 c
5 c
6 a
7 a
8 a
dtype: object
[20]: s.unique()
[21]: s.value_counts()
[21]: b 4
a 3
c 2
dtype: int64
34
[22]: x=s.isin(["b","c"])
x
[22]: 0 True
1 True
2 True
3 True
4 True
5 True
6 False
7 False
8 False
dtype: bool
[23]: s[x]
[23]: 0 b
1 b
2 b
3 b
4 c
5 c
dtype: object
[2]: df=pd.read_table("Data/data.txt")
[3]: df2=pd.read_table("Data/data2.txt")
df2
[3]: Tom,80,M
0 Tim,85,M
1 Kim,70,M
2 Kate,90,F
3 Alex,75,F
35
[4]: Tom 80 M
0 Tim 85 M
1 Kim 70 M
2 Kate 90 F
3 Alex 75 F
[5]: Tom 80 M
0 Tim 85 M
1 Kim 70 M
2 Kate 90 F
3 Alex 75 F
[6]: df=pd.read_table("Data/data2.txt",
sep=",",
header=None)
df
[6]: 0 1 2
0 Tom 80 M
1 Tim 85 M
2 Kim 70 M
3 Kate 90 F
4 Alex 75 F
[7]: df=pd.read_table("Data/data2.txt",
sep=",",
header=None,
names=["name","score","sex"])
df
[8]: df=pd.read_table("Data/data2.txt",
sep=",",header=None,
names=["name","score","sex"],
index_col="name")
df
36
[8]: score sex
name
Tom 80 M
Tim 85 M
Kim 70 M
Kate 90 F
Alex 75 F
[9]: df2=pd.read_table("Data/data3.txt",
sep=",")
df2
[10]: df2=pd.read_table("Data/data3.txt",
sep=",",
index_col=["lesson","name"])
df2
[11]: df3=pd.read_table("Data/data4.txt",sep=",")
df3
[11]: #hello
name score sex
#scores of students NaN NaN
Tim 80 M
Kate 85 F
Alex 70 M
Tom 90 M
Kim 75 F
37
[12]: df3=pd.read_table("Data/data4.txt",
sep=",",
skiprows=[0,2])
df3
[13]: df3=pd.read_table("Data/data4.txt",
sep=",",
skiprows=[0,2],
usecols=[0,1])
df3
[14]: df3=pd.read_table("Data/data4.txt",
sep=",",
skiprows=[0,2],
usecols=[0,1],
nrows=3)
df3
38
4 Efe 75 M
[16]: df.to_csv("Data/new_data.csv")
[2]: s=pd.Series(["Sam",np.nan,"Tim","Kim"])
s
[2]: 0 Sam
1 NaN
2 Tim
3 Kim
dtype: object
[3]: s.isnull()
[3]: 0 False
1 True
2 False
3 False
dtype: bool
[4]: s.notnull()
[4]: 0 True
1 False
2 True
3 True
dtype: bool
[5]: s[3]=None
s.isnull()
[5]: 0 False
1 True
2 False
3 True
dtype: bool
[6]: s.dropna()
39
[6]: 0 Sam
2 Tim
dtype: object
[8]: df=pd.DataFrame([[1,2,3],[4,NA,5],
[NA,NA,NA]])
df
[8]: 0 1 2
0 1.0 2.0 3.0
1 4.0 NaN 5.0
2 NaN NaN NaN
[9]: df.dropna()
[9]: 0 1 2
0 1.0 2.0 3.0
[10]: df.dropna(how="all")
[10]: 0 1 2
0 1.0 2.0 3.0
1 4.0 NaN 5.0
[11]: df
[11]: 0 1 2
0 1.0 2.0 3.0
1 4.0 NaN 5.0
2 NaN NaN NaN
[12]: df[1]=NA
df
[12]: 0 1 2
0 1.0 NaN 3.0
1 4.0 NaN 5.0
2 NaN NaN NaN
[13]: df.dropna(axis=1,how="all")
[13]: 0 2
0 1.0 3.0
1 4.0 5.0
2 NaN NaN
40
[14]: df
[14]: 0 1 2
0 1.0 NaN 3.0
1 4.0 NaN 5.0
2 NaN NaN NaN
[15]: df.dropna(thresh=3)
df
[15]: 0 1 2
0 1.0 NaN 3.0
1 4.0 NaN 5.0
2 NaN NaN NaN
[16]: df.fillna(0)
[16]: 0 1 2
0 1.0 0.0 3.0
1 4.0 0.0 5.0
2 0.0 0.0 0.0
[17]: df.fillna({0:15,1:25,2:35})
[17]: 0 1 2
0 1.0 25.0 3.0
1 4.0 25.0 5.0
2 15.0 25.0 35.0
[18]: df
[18]: 0 1 2
0 1.0 NaN 3.0
1 4.0 NaN 5.0
2 NaN NaN NaN
[19]: df.fillna(0,inplace=True)
df
[19]: 0 1 2
0 1.0 0.0 3.0
1 4.0 0.0 5.0
2 0.0 0.0 0.0
[20]: df=pd.DataFrame([[1,2,3],[4,NA,5],
[NA,NA,NA]])
df
41
[20]: 0 1 2
0 1.0 2.0 3.0
1 4.0 NaN 5.0
2 NaN NaN NaN
[21]: df.fillna(method="ffill")
[21]: 0 1 2
0 1.0 2.0 3.0
1 4.0 2.0 5.0
2 4.0 2.0 5.0
[22]: df.fillna(method="ffill",limit=1)
[22]: 0 1 2
0 1.0 2.0 3.0
1 4.0 2.0 5.0
2 4.0 NaN 5.0
[23]: data=pd.Series([1,0,NA,5])
data
[23]: 0 1.0
1 0.0
2 NaN
3 5.0
dtype: float64
[24]: data.fillna(data.mean())
[24]: 0 1.0
1 0.0
2 2.0
3 5.0
dtype: float64
[25]: df
[25]: 0 1 2
0 1.0 2.0 3.0
1 4.0 NaN 5.0
2 NaN NaN NaN
[26]: df.fillna(df.mean())
[26]: 0 1 2
0 1.0 2.0 3.0
42
1 4.0 2.0 5.0
2 2.5 2.0 4.0
[2]: data=pd.DataFrame({"a":["one","two"]*3,
"b":[1,1,2,3,2,3]})
data
[2]: a b
0 one 1
1 two 1
2 one 2
3 two 3
4 one 2
5 two 3
[3]: data.duplicated()
[3]: 0 False
1 False
2 False
3 False
4 True
5 True
dtype: bool
[4]: data.drop_duplicates()
[4]: a b
0 one 1
1 two 1
2 one 2
3 two 3
[5]: data["c"]=range(6)
data
[5]: a b c
0 one 1 0
1 two 1 1
2 one 2 2
3 two 3 3
43
4 one 2 4
5 two 3 5
[6]: data.duplicated(["a","b"],keep="last")
[6]: 0 False
1 False
2 True
3 True
4 False
5 False
dtype: bool
[7]: df=pd.DataFrame({"names":["Tim","tom","Sam",
"kate","Kim"],
"scores":[60,50,70,80,40]})
df
[8]: classes={"Tim":"A","Tom":"A","Sam":"B",
"Kate":"B","Kim":"B"}
[9]: n=df["names"].str.capitalize()
[10]: df["branches"]=n.map(classes)
[11]: df
[12]: s=pd.Series([80,70,90,60])
s
[12]: 0 80
1 70
2 90
44
3 60
dtype: int64
[14]: s.replace(70,np.nan)
[14]: 0 80.0
1 NaN
2 90.0
3 60.0
dtype: float64
[15]: s.replace([70,60],[np.nan,0])
[15]: 0 80.0
1 NaN
2 90.0
3 0.0
dtype: float64
[16]: s.replace({90:100,60:0})
[16]: 0 80
1 70
2 100
3 0
dtype: int64
[17]: df=pd.DataFrame(
np.arange(12).reshape(3,4),
index=[0,1,2],
columns=["tim","tom","kim","sam"])
df
[18]: s=pd.Series(["one","two","three"])
df.index=df.index.map(s)
[19]: df
45
two 4 5 6 7
three 8 9 10 11
[20]: df.rename(index=str.title,columns=str.upper)
[21]: df.rename(index={"one":"ten"},
columns={"sam":"kate"},
inplace=True)
df
[22]: sc=[30,80,40,90,60,45,95,75,55,100,65,85]
[23]: x=[20,40,60,80,100]
[24]: y=pd.cut(sc,x)
y
[24]: [(20, 40], (60, 80], (20, 40], (80, 100], (40, 60], …, (60, 80], (40, 60],
(80, 100], (60, 80], (80, 100]]
Length: 12
Categories (4, interval[int64]): [(20, 40] < (40, 60] < (60, 80] < (80, 100]]
[25]: y.codes
[26]: y.categories
[27]: pd.value_counts(y)
46
(20, 40] 2
dtype: int64
[28]: y=pd.cut(sc,x,right=False)
y
[28]: [[20, 40), [80, 100), [40, 60), [80, 100), [60, 80), …, [60.0, 80.0), [40.0,
60.0), NaN, [60.0, 80.0), [80.0, 100.0)]
Length: 12
Categories (4, interval[int64]): [[20, 40) < [40, 60) < [60, 80) < [80, 100)]
[29]: ['low', 'high', 'low', 'very high', 'medium', …, 'high', 'medium', 'very
high', 'high', 'very high']
Length: 12
Categories (4, object): ['low' < 'medium' < 'high' < 'very high']
[30]: pd.cut(sc,10)
[30]: [(29.93, 37.0], (79.0, 86.0], (37.0, 44.0], (86.0, 93.0], (58.0, 65.0], …,
(72.0, 79.0], (51.0, 58.0], (93.0, 100.0], (58.0, 65.0], (79.0, 86.0]]
Length: 12
Categories (10, interval[float64]): [(29.93, 37.0] < (37.0, 44.0] < (44.0, 51.0]
< (51.0, 58.0] … (72.0, 79.0] < (79.0, 86.0] < (86.0, 93.0] < (93.0, 100.0]]
[31]: data=np.random.randn(100)
c=pd.qcut(data,4)
c
[31]: [(-0.135, 0.631], (0.631, 2.286], (-0.629, -0.135], (0.631, 2.286], (-2.186,
-0.629], …, (-0.629, -0.135], (0.631, 2.286], (0.631, 2.286], (-0.135, 0.631],
(-0.629, -0.135]]
Length: 100
Categories (4, interval[float64]): [(-2.186, -0.629] < (-0.629, -0.135] <
(-0.135, 0.631] < (0.631, 2.286]]
[32]: pd.value_counts(c)
47
[33]: data=pd.DataFrame(np.random.randn(1000,4))
data.head()
[33]: 0 1 2 3
0 0.661162 -1.315550 0.138893 -2.186859
1 -0.422096 0.587658 -0.478577 -0.285737
2 -0.283092 -0.021623 1.194335 -0.197599
3 -1.545286 -0.219977 0.353704 0.424970
4 -0.196521 3.491917 0.016217 -0.464119
[34]: data.describe()
[34]: 0 1 2 3
count 1000.000000 1000.000000 1000.000000 1000.000000
mean -0.048690 0.030168 -0.050958 -0.004881
std 0.991747 0.996656 0.992911 1.021762
min -2.999425 -3.514219 -2.864107 -3.186480
25% -0.723203 -0.622904 -0.726046 -0.752119
50% -0.058605 0.027484 -0.036939 0.023319
75% 0.560961 0.654890 0.605512 0.739495
max 3.952527 3.491917 3.074154 3.002287
[35]: col=data[1]
[36]: col[np.abs(col)>3]
[36]: 4 3.491917
117 3.098920
311 -3.067095
903 -3.514219
Name: 1, dtype: float64
[37]: data[(np.abs(data)>3).any(1)]
[37]: 0 1 2 3
4 -0.196521 3.491917 0.016217 -0.464119
117 -1.360727 3.098920 -0.902404 -1.874759
311 -1.408380 -3.067095 -0.034621 -1.377992
492 -0.508447 -0.264550 -2.246989 -3.096799
510 3.952527 -0.307437 1.160524 1.022866
547 -0.146700 -0.594890 3.074154 -0.198825
642 3.116662 -0.466845 -0.543486 0.038652
867 0.222411 -0.131135 0.618451 3.002287
903 -0.917298 -3.514219 -2.500715 0.259489
956 0.115918 1.314808 -0.220663 -3.186480
[38]: np.sign(data).head()
48
[38]: 0 1 2 3
0 1.0 -1.0 1.0 -1.0
1 -1.0 1.0 -1.0 -1.0
2 -1.0 -1.0 1.0 -1.0
3 -1.0 -1.0 1.0 1.0
4 -1.0 1.0 1.0 -1.0
[39]: data=pd.DataFrame(
np.arange(12).reshape(4,3))
data
[39]: 0 1 2
0 0 1 2
1 3 4 5
2 6 7 8
3 9 10 11
[40]: rw=np.random.permutation(4)
rw
[41]: data.take(rw)
[41]: 0 1 2
2 6 7 8
3 9 10 11
0 0 1 2
1 3 4 5
[42]: data.sample()
[42]: 0 1 2
3 9 10 11
[43]: data.sample(n=2)
[43]: 0 1 2
0 0 1 2
2 6 7 8
49
[44]: letter number
0 c 0
1 b 1
2 a 2
3 b 3
4 b 4
5 a 5
[45]: pd.get_dummies(data["letter"])
[45]: a b c
0 0 0 1
1 0 1 0
2 1 0 0
3 0 1 0
4 0 1 0
5 1 0 0
[46]: data=np.random.randn(10)
data
[47]: pd.get_dummies(pd.cut(data,4))
50
[2]: data=pd.Series(np.random.randn(8),
index=[["a","a","a","b",
"b","b","c","c"],
[1,2,3,1,2,3,1,2]])
data
[2]: a 1 0.022235
2 0.007393
3 -3.081152
b 1 -0.673017
2 -0.034024
3 0.679701
c 1 1.175051
2 0.916181
dtype: float64
[3]: data.index
[4]: data["a"]
[4]: 1 0.022235
2 0.007393
3 -3.081152
dtype: float64
[5]: data["b":"c"]
[5]: b 1 -0.673017
2 -0.034024
3 0.679701
c 1 1.175051
2 0.916181
dtype: float64
[6]: data.loc[["a","c"]]
51
[6]: a 1 0.022235
2 0.007393
3 -3.081152
c 1 1.175051
2 0.916181
dtype: float64
[7]: data.loc[:,1]
[7]: a 0.022235
b -0.673017
c 1.175051
dtype: float64
[8]: data.unstack()
[8]: 1 2 3
a 0.022235 0.007393 -3.081152
b -0.673017 -0.034024 0.679701
c 1.175051 0.916181 NaN
[9]: data.unstack().stack()
[9]: a 1 0.022235
2 0.007393
3 -3.081152
b 1 -0.673017
2 -0.034024
3 0.679701
c 1 1.175051
2 0.916181
dtype: float64
[10]: df=pd.DataFrame(
np.arange(12).reshape(4,3),
index=[["a","a","b","b"],
[1,2,1,2]],
columns=[["num","num","ver"],
["math","stat","geo"]])
df
52
[11]: df.index.names=["class","exam"]
df.columns.names=["field","lesson"]
df
[12]: df["num"]
[13]: df.swaplevel("class","exam")
[14]: df.sort_index(level=1)
53
exam
1 6 8 10
2 12 14 16
[16]: df.sum(level="field",axis=1)
[17]: x y a b
0 0 8 one 0
1 1 7 one 1
2 2 6 one 2
3 3 5 one 3
4 4 4 two 0
5 5 3 two 1
6 6 2 two 2
7 7 1 two 3
[18]: data2=data.set_index(["a","b"])
data2
[18]: x y
a b
one 0 0 8
1 1 7
2 2 6
3 3 5
two 0 4 4
1 5 3
2 6 2
3 7 1
54
[19]: data3=data.set_index(["a","b"],drop=False)
data3
[19]: x y a b
a b
one 0 0 8 one 0
1 1 7 one 1
2 2 6 one 2
3 3 5 one 3
two 0 4 4 two 0
1 5 3 two 1
2 6 2 two 2
3 7 1 two 3
[20]: data2
[20]: x y
a b
one 0 0 8
1 1 7
2 2 6
3 3 5
two 0 4 4
1 5 3
2 6 2
3 7 1
[21]: data2.reset_index()
[21]: a b x y
0 one 0 0 8
1 one 1 1 7
2 one 2 2 6
3 one 3 3 5
4 two 0 4 4
5 two 1 5 3
6 two 2 6 2
7 two 3 7 1
55
14.1 Joining DataFrame
[2]: d1=pd.DataFrame(
{"key":["a","b","c","c","d","e"],
"num1":range(6)})
d2=pd.DataFrame(
{"key":["b","c","e","f"],
"num2":range(4)})
[3]: print(d1)
print(d2)
key num1
0 a 0
1 b 1
2 c 2
3 c 3
4 d 4
5 e 5
key num2
0 b 0
1 c 1
2 e 2
3 f 3
[6]: d3=pd.DataFrame(
{"key1":["a","b","c","c","d","e"],
"num1":range(6)})
d4=pd.DataFrame(
{"key2":["b","c","e","f"],
"num2":range(4)})
56
[7]: pd.merge(
d3,d4,left_on="key1",right_on="key2"
)
[8]: pd.merge(d1,d2,how="outer")
[9]: pd.merge(d1,d2,how="left")
[10]: pd.merge(d1,d2,how="right")
57
[12]: df1=pd.DataFrame(
{"key":["a","b","c","c","d","e"],
"num1":range(6),
"count":["one","three","two",
"one","one","two"]})
df2=pd.DataFrame(
{"key":["b","c","e","f"],
"num2":range(4),
"count":["one","two","two","two"]})
58
14.2 Merging on index
[16]: df1=pd.DataFrame(
{"letter":["a","a","b",
"b","a","c"],
"num":range(6)})
df2=pd.DataFrame(
{"value":[3,5,7]},
index=["a","b","e"])
[17]: print(df1)
print(df2)
letter num
0 a 0
1 a 1
2 b 2
3 b 3
4 a 4
5 c 5
value
a 3
b 5
e 7
[18]: pd.merge(df1,df2,
left_on="letter",
right_index=True)
[19]: right=pd.DataFrame(
[[1,2],[3,4],[5,6]],
index=["a","c","d"],
columns=["Tom","Tim"])
left=pd.DataFrame(
[[7,8],[9,10],[11,12],[13,14]],
index=["a","b","e","f"],
columns=["Sam","Kim"])
[20]: pd.merge(right,left,
right_index=True,
59
left_index=True,
how="outer")
[21]: left.join(right)
[22]: left.join(right,how="outer")
[23]: data=pd.DataFrame([[1,3],[5,7],[9,11]],
index=["a","b","f"],
columns=["Alex","Keta"])
left.join([right,data])
60
[25]: array([[ 0, 1, 2, 3, 4, 0, 1, 2, 3, 4],
[ 5, 6, 7, 8, 9, 5, 6, 7, 8, 9],
[10, 11, 12, 13, 14, 10, 11, 12, 13, 14],
[15, 16, 17, 18, 19, 15, 16, 17, 18, 19]])
[28]: pd.concat([data1,data2,data3])
[28]: a 0
b 1
c 2
d 3
e 4
f 5
g 6
dtype: int64
[29]: 0 1 2
a 0.0 NaN NaN
b 1.0 NaN NaN
c NaN 2.0 NaN
d NaN 3.0 NaN
e NaN 4.0 NaN
f NaN NaN 5.0
g NaN NaN 6.0
61
pd.concat([data1,data4],axis=1,join="inner")
[30]: 0 1
a 0 10
b 1 11
[31]: one a 0
b 1
two c 2
d 3
e 4
three a 10
b 11
c 12
dtype: int64
[34]: s1 s2
one two three four
62
a 0 1 10.0 11.0
b 2 3 NaN NaN
c 4 5 12.0 13.0
[36]: a b c d
0 0.443128 1.033878 -0.081062 0.720712
1 1.249823 1.695462 -1.911692 -2.135979
2 0.970119 0.152867 0.210750 0.736984
3 -0.930846 -1.478824 NaN 0.084256
4 -0.420467 1.158122 NaN 0.501372
[2]: data=pd.DataFrame(
np.arange(16).reshape(4,4),
index=[list("aabb"),[1,2]*2],
columns=[["num","num",
"comp","comp"],
["math","stat"]*2])
data
[3]: data.index.names=["class","exam"]
data.columns.names=["field","lesson"]
data
63
[3]: field num comp
lesson math stat math stat
class exam
a 1 0 1 2 3
2 4 5 6 7
b 1 8 9 10 11
2 12 13 14 15
[4]: long=data.stack()
[5]: long
[6]: long.unstack()
[7]: data.stack()
[8]: data.stack(0)
64
[8]: lesson math stat
class exam field
a 1 comp 2 3
num 0 1
2 comp 6 7
num 4 5
b 1 comp 10 11
num 8 9
2 comp 14 15
num 12 13
[9]: data.stack("field")
[10]: s1=pd.Series(
np.arange(4),index=list("abcd"))
s2=pd.Series(
np.arange(6,9),index=list("cde"))
[11]: print(s1)
print(s2)
a 0
b 1
c 2
d 3
dtype: int32
c 6
d 7
e 8
dtype: int32
[12]: data2=pd.concat([s1,s2],keys=["bir","iki"])
data2
[12]: bir a 0
b 1
65
c 2
d 3
iki c 6
d 7
e 8
dtype: int32
[13]: data2.unstack()
[13]: a b c d e
bir 0.0 1.0 2.0 3.0 NaN
iki NaN NaN 6.0 7.0 8.0
[14]: data2.unstack().stack(dropna=False)
[15]: data2.unstack().stack(dropna=False)
66
[17]: stock
[19]: stock["value"]=np.random.randn(len(stock))
[20]: stock
[21]: p=stock.pivot("fruit","color")
p
[22]: p["value"]
67
15.2 Pivoting “Wide” to “Long” Format
[23]: data=pd.DataFrame(
{"lesson":["math","stat","bio"],
"Sam":[50,60,70],
"Kim":[80,70,90],
"Tom":[60,70,85]})
data
[24]: group=pd.melt(data,["lesson"])
[25]: group
[26]: data=group.pivot(
"lesson","variable","value")
data
[27]: data.reset_index()
68
16 What is Groupby in Pandas?
[1]: import pandas as pd
import numpy as np
[2]: df=pd.DataFrame(
{"key1":list("aabbab"),
"key2":["one","two","three"]*2,
"data1":np.random.randn(6),
"data2":np.random.randn(6)})
df
[3]: group=df["data1"].groupby(df["key1"])
[4]: group
[5]: group.mean()
[5]: key1
a -0.394878
b 0.487807
Name: data1, dtype: float64
[6]: ave=df["data1"].groupby([df["key1"],
df["key2"]]).mean()
ave
[7]: ave.unstack()
69
[7]: key2 one three two
key1
a 0.128979 NaN -0.656807
b 2.135132 -0.335856 NaN
[8]: df.groupby("key1").mean()
[9]: df.groupby(["key1","key2"]).mean()
a
key1 key2 data1 data2
0 a one 0.128979 0.903436
1 a two -0.334460 -1.431566
4 a two -0.979153 1.918519
b
key1 key2 data1 data2
2 b three -0.506455 -0.854207
3 b one 2.135132 -0.996191
5 b three -0.165257 0.204901
a one
key1 key2 data1 data2
0 a one 0.128979 0.903436
a two
70
key1 key2 data1 data2
1 a two -0.334460 -1.431566
4 a two -0.979153 1.918519
b one
key1 key2 data1 data2
3 b one 2.135132 -0.996191
b three
key1 key2 data1 data2
2 b three -0.506455 -0.854207
5 b three -0.165257 0.204901
[12]: piece=dict(list(df.groupby("key1")))
[13]: piece["a"]
[14]: data1
key1 key2
a one 0.128979
two -0.656807
b one 2.135132
three -0.335856
[15]: a b c d
apple 0.803066 0.165556 0.040465 -0.376024
cherry -0.265198 0.778739 0.574622 -1.292316
banana -0.977442 -0.458472 -1.271370 0.614398
kiwi 0.580412 0.061148 1.257117 -1.351419
71
[16]: label={"a": "green","b":"yellow",
"c":"green","d":"yellow",
"e":"purple"}
[17]: group=fruit.groupby(label,axis=1)
[18]: group.sum()
[19]: s=pd.Series(label)
s
[19]: a green
b yellow
c green
d yellow
e purple
dtype: object
[20]: fruit.groupby(s,axis=1).count()
[21]: a b c d
4 0.580412 0.061148 1.257117 -1.351419
5 0.803066 0.165556 0.040465 -0.376024
6 -1.242640 0.320267 -0.696747 -0.677917
72
[23]: data.columns.names=["letter","number"]
data
[23]: letter A B
number 1 2 3 1 2
0 0.741181 -0.399735 0.562333 -1.035530 0.678250
1 0.394531 0.745952 -0.661248 0.811781 -0.804934
2 0.028793 -0.914979 0.857640 -0.780221 1.898880
3 -0.029662 0.092263 1.424289 -0.143006 1.484412
[24]: data.groupby(level="letter",axis=1).sum()
[24]: letter A B
0 0.903779 -0.357280
1 0.479235 0.006847
2 -0.028546 1.118659
3 1.486890 1.341406
[26]: game.head()
[27]: game.dtypes
73
EU_Sales float64
JP_Sales float64
Other_Sales float64
Global_Sales float64
dtype: object
[28]: game.dropna().describe()
Other_Sales Global_Sales
count 16291.000000 16291.000000
mean 0.048426 0.540910
std 0.190083 1.567345
min 0.000000 0.010000
25% 0.000000 0.060000
50% 0.010000 0.170000
75% 0.040000 0.480000
max 10.570000 82.740000
[29]: game.Global_Sales.mean()
[29]: 0.5374406555006628
[30]: group=game.groupby("Genre")
[31]: group["Global_Sales"].count()
[31]: Genre
Action 3316
Adventure 1286
Fighting 848
Misc 1739
Platform 886
Puzzle 582
Racing 1249
Role-Playing 1488
Shooter 1310
Simulation 867
74
Sports 2346
Strategy 681
Name: Global_Sales, dtype: int64
[32]: group["Global_Sales"].describe()
[33]: game[game.Genre=="Action"].Global_Sales.mean()
[33]: 0.5281001206272617
[34]: group.mean()
Other_Sales Global_Sales
Genre
Action 0.056508 0.528100
Adventure 0.013072 0.185879
Fighting 0.043255 0.529375
75
Misc 0.043312 0.465762
Platform 0.058228 0.938341
Puzzle 0.021564 0.420876
Racing 0.061865 0.586101
Role-Playing 0.040060 0.623233
Shooter 0.078389 0.791885
Simulation 0.036355 0.452364
Sports 0.057532 0.567319
Strategy 0.016681 0.257151
[36]: group["Global_Sales"].mean().plot(kind="bar")
[36]: <AxesSubplot:xlabel='Genre'>
[37]: group[["NA_Sales",
"EU_Sales",
"JP_Sales"]].mean().plot(kind="bar")
[37]: <AxesSubplot:xlabel='Genre'>
76
from file: 17-Working with GroupBy
[2]: df=pd.DataFrame({"key":list("ABC")*2,
"data1":range(6),
"data2":np.arange(5,11)})
[3]: df
77
[4]: group=df.groupby("key")
[5]: group.aggregate(["min",np.median,"max"])
[6]: group.agg({"data1":"min","data2":"max"})
[8]: group.agg(f)
[10]: data
78
7 B two 0.209867 24
8 C one -1.164496 26
9 A two 0.194654 28
10 B one 0.160770 30
11 C two -0.301062 32
[11]: group=data.groupby(["letter","num"])
[12]: group_d1=group["d1"]
[13]: group_d1.agg("mean")
[14]: group_d1.agg(["mean","std",f])
[15]: group_d1.agg([("f_mean","mean"),
("f_std",np.std)])
[16]: group.agg({"d1":["count","max","mean"],
"d2":"sum"})
79
[16]: d1 d2
count max mean sum
letter num
A one 2 0.988512 -0.106948 32
two 2 0.194654 -0.417619 44
B one 2 0.232158 0.196464 48
two 2 1.318566 0.764216 36
C one 2 0.606850 -0.278823 40
two 2 0.380758 0.039848 52
[17]: data.groupby(["letter","num"],
as_index=False).mean()
17.2 Split-Apply-Combine
[18]: data
[19]: group=data.groupby("letter")
[20]: letter
A count 4.000000
mean 19.000000
std 7.745967
80
min 10.000000
25% 14.500000
50% 19.000000
75% 23.500000
max 28.000000
B count 4.000000
mean 21.000000
std 7.745967
min 12.000000
25% 16.500000
50% 21.000000
75% 25.500000
max 30.000000
C count 4.000000
mean 23.000000
std 7.745967
min 14.000000
25% 18.500000
50% 23.000000
75% 27.500000
max 32.000000
Name: d2, dtype: float64
[21]: math=pd.DataFrame({"Class":list("AB")*3,
"Stu":["Kim","Sam",
"Tim","Tom","John","Kate"],
"Score":[60,70,np.nan,
55,np.nan,80]})
math
[22]: group=math.groupby("Class")
[23]: group.mean()
[23]: Score
Class
A 60.000000
B 68.333333
81
[24]: func=lambda f:f.fillna(f.mean())
[25]: group.apply(func)
[26]: value={"A":100,"B":50}
[28]: group.apply(func1)
18 Pivot Tables
[1]: import pandas as pd
import numpy as np
[2]: df=pd.DataFrame(
{"class":list("ABC")*4,
"lesson":["math","stat"]*6,
"sex":list("MFMM")*3,
"sibling":[1,2,3]*4,
"score":np.arange(40,100,5)})
[3]: df
82
3 A stat M 1 55
4 B math M 2 60
5 C stat F 3 65
6 A math M 1 70
7 B stat M 2 75
8 C math M 3 80
9 A stat F 1 85
10 B math M 2 90
11 C stat M 3 95
[4]: df.groupby("lesson")["score"].mean()
[4]: lesson
math 65
stat 70
Name: score, dtype: int32
[5]: df.groupby(
["lesson",
"class"])[
"score"].aggregate("mean").unstack()
[5]: class A B C
lesson
math 55 75 65
stat 70 60 80
[6]: df.pivot_table(
"score",
index="lesson",
columns="class")
[6]: class A B C
lesson
math 55 75 65
stat 70 60 80
[7]: df.pivot_table(
["sibling","score"],
index=["class","lesson"],
columns="sex")
83
B math NaN 75.0 NaN 2.0
stat 45.0 75.0 2.0 2.0
C math NaN 65.0 NaN 3.0
stat 65.0 95.0 3.0 3.0
[8]: df.pivot_table(
["sibling","score"],
index=["class","lesson"],
columns="sex",margins=True)
[9]: df.pivot_table(
["sibling","score"],
index=["class","lesson"],
columns="sex",fill_value=0)
[11]: df.pivot_table("score",
["lesson",sib],
"class",fill_value=0)
[11]: class A B C
lesson sibling
math (0, 2] 55 75 0
84
(2, 3] 0 0 65
stat (0, 2] 70 60 0
(2, 3] 0 0 80
[12]: df.pivot_table(
"score",
index="lesson",
columns="class")
[12]: class A B C
lesson
math 55 75 65
stat 70 60 80
[13]: df.pivot_table(
index="lesson",
columns="class",
aggfunc="sum")
[14]: df.pivot_table(
index="lesson",
columns="class",
aggfunc={"sibling":"max",
"score":"sum"})
85
[16]: pd.crosstab([df.sibling, df.lesson], df.sex)
[16]: sex F M
sibling lesson
1 math 0 2
stat 1 1
2 math 0 2
stat 1 1
3 math 0 2
stat 1 1
[17]: births=pd.read_csv("Data/births.txt")
[18]: births.head()
[19]: births["ten_year"]=10*(births["year"]//10)
births.pivot_table("births",
index="ten_year",
columns="gender",
aggfunc="sum")
[19]: gender F M
ten_year
1960 1753634 1846572
1970 16263075 17121550
1980 18310351 19243452
1990 19479454 20420553
2000 18229309 19106428
86
[23]: births.pivot_table("births",
index="year",
columns="gender",
aggfunc="sum").plot()
plt.ylabel("Annual total births")
[2]: data=pd.Series(["Tim","Tom","Sam","Sam"]*3)
data
[2]: 0 Tim
1 Tom
2 Sam
87
3 Sam
4 Tim
5 Tom
6 Sam
7 Sam
8 Tim
9 Tom
10 Sam
11 Sam
dtype: object
[3]: pd.unique(data)
[4]: pd.value_counts(data)
[4]: Sam 6
Tom 3
Tim 3
dtype: int64
[5]: values=pd.Series([0,1,0,0]*3)
[6]: names=pd.Series(["Tim","Sam"])
names.take(values)
[6]: 0 Tim
1 Sam
0 Tim
0 Tim
0 Tim
1 Sam
0 Tim
0 Tim
0 Tim
1 Sam
0 Tim
0 Tim
dtype: object
88
[7]: 0 Tim
1 Tom
2 Sam
3 Sam
4 Tim
5 Tom
6 Sam
7 Sam
8 Tim
9 Tom
10 Sam
11 Sam
dtype: object
[8]: N=len(data)
[9]: df=pd.DataFrame(
{"name":data,
"num":np.arange(N),
"score":np.random.randint(40,100,
size=N),
"weight":np.random.uniform(50,70,
size=N)},
columns=["num","name","score","weight"])
[10]: df
[11]: df["name"]
[11]: 0 Tim
1 Tom
2 Sam
3 Sam
89
4 Tim
5 Tom
6 Sam
7 Sam
8 Tim
9 Tom
10 Sam
11 Sam
Name: name, dtype: object
[12]: type(df["name"])
[12]: pandas.core.series.Series
[13]: name_cat=df["name"].astype("category")
name_cat
[13]: 0 Tim
1 Tom
2 Sam
3 Sam
4 Tim
5 Tom
6 Sam
7 Sam
8 Tim
9 Tom
10 Sam
11 Sam
Name: name, dtype: category
Categories (3, object): ['Sam', 'Tim', 'Tom']
[14]: x=name_cat.values
[15]: x.categories
[16]: x.codes
[17]: df["name"]=df["name"].astype("category")
df.name
[17]: 0 Tim
1 Tom
90
2 Sam
3 Sam
4 Tim
5 Tom
6 Sam
7 Sam
8 Tim
9 Tom
10 Sam
11 Sam
Name: name, dtype: category
Categories (3, object): ['Sam', 'Tim', 'Tom']
[18]: data_cat=pd.Categorical(list("abcde"))
data_cat
[21]: people_cat=pd.Categorical.from_codes(
codes,people,ordered=True)
people_cat
[22]: people_cat.as_ordered()
91
19.3 Working with Categorical
[23]: data=np.random.randn(1000)
[24]: interval=pd.qcut(data,4)
interval
[25]: type(interval)
[25]: pandas.core.arrays.categorical.Categorical
[26]: interval=pd.qcut(data,4,labels=["Q1","Q2",
"Q3","Q4"])
interval
[26]: ['Q1', 'Q4', 'Q4', 'Q3', 'Q1', …, 'Q2', 'Q4', 'Q2', 'Q4', 'Q1']
Length: 1000
Categories (4, object): ['Q1' < 'Q2' < 'Q3' < 'Q4']
[27]: interval=pd.Series(interval,name="quarter")
[28]: pd.Series(
data).groupby(
interval).agg(["count",
"min",
"max"]).reset_index()
[30]: label=pd.Series(["a","b","c","d"]*(N//4))
92
[31]: cat=label.astype("category")
[32]: label.memory_usage()
[32]: 80000128
[33]: cat.memory_usage()
[33]: 10000320
[35]: s_ct=s.astype("category")
s_ct
[35]: 0 a
1 b
2 c
3 d
4 a
5 b
6 c
7 d
dtype: category
Categories (4, object): ['a', 'b', 'c', 'd']
[36]: s_ct.cat.codes
[36]: 0 0
1 1
2 2
3 3
4 0
5 1
6 2
7 3
dtype: int8
[37]: s_ct.cat.categories
[38]: new_ct=["a","b","c","d","e"]
s_ct.cat.set_categories(new_ct)
93
[38]: 0 a
1 b
2 c
3 d
4 a
5 b
6 c
7 d
dtype: category
Categories (5, object): ['a', 'b', 'c', 'd', 'e']
[39]: s2_ct=s_ct[s_ct.isin(["a","b"])]
s2_ct
[39]: 0 a
1 b
4 a
5 b
dtype: category
Categories (4, object): ['a', 'b', 'c', 'd']
[40]: s2_ct.cat.remove_unused_categories()
[40]: 0 a
1 b
4 a
5 b
dtype: category
Categories (2, object): ['a', 'b']
[41]: 0 a
1 b
2 c
3 d
4 a
5 b
6 c
7 d
dtype: category
Categories (4, object): ['a', 'b', 'c', 'd']
[42]: pd.get_dummies(s_ct)
94
[42]: a b c d
0 1 0 0 0
1 0 1 0 0
2 0 0 1 0
3 0 0 0 1
4 1 0 0 0
5 0 1 0 0
6 0 0 1 0
7 0 0 0 1
[1]: 'HELLO'
[3]: data=["tim","Kate","SUSan",np.nan,"aLEX"]
[4]: name=pd.Series(data)
[5]: name.str.capitalize()
[5]: 0 Tim
1 Kate
2 Susan
3 NaN
4 Alex
dtype: object
[6]: name.str.lower()
[6]: 0 tim
1 kate
2 susan
3 NaN
4 alex
dtype: object
[7]: name.str.len()
95
[7]: 0 3.0
1 4.0
2 5.0
3 NaN
4 4.0
dtype: float64
[8]: name.str.startswith("a")
[8]: 0 False
1 False
2 False
3 NaN
4 True
dtype: object
[9]: df=pd.DataFrame(
np.random.randn(3,2),
columns=["Column A","Column B"],
index=range(3))
df
[10]: df.columns
[12]: s=pd.Series(["a_b_c","c_d_e",np.nan,"f_g_h"])
s
[12]: 0 a_b_c
1 c_d_e
2 NaN
3 f_g_h
dtype: object
[13]: s.str.split("_").str[1]
96
[13]: 0 b
1 d
2 NaN
3 g
dtype: object
[14]: s.str.split("_",expand=True,n=1)
[14]: 0 1
0 a b_c
1 c d_e
2 NaN NaN
3 f g_h
[15]: money=pd.Series(["15","-$20","$30000"])
money
[15]: 0 15
1 -$20
2 $30000
dtype: object
[16]: money.str.replace("-\$","")
[16]: 0 15
1 20
2 $30000
dtype: object
[17]: money.str.replace("-\$","-")
[17]: 0 15
1 -20
2 $30000
dtype: object
You can use google or pandas.pydata.org to see the string methods of Pandas documentation.
[18]: film=pd.read_csv("http://bit.ly/imdbratings")
[19]: film.head()
97
4 8.9 Pulp Fiction R Crime 154
actors_list
0 [u'Tim Robbins', u'Morgan Freeman', u'Bob Gunt…
1 [u'Marlon Brando', u'Al Pacino', u'James Caan']
2 [u'Al Pacino', u'Robert De Niro', u'Robert Duv…
3 [u'Christian Bale', u'Heath Ledger', u'Aaron E…
4 [u'John Travolta', u'Uma Thurman', u'Samuel L…
[20]: film.title.str.upper()
[21]: film.columns=film.columns.str.capitalize()
[22]: film.head()
Actors_list
0 [u'Tim Robbins', u'Morgan Freeman', u'Bob Gunt…
1 [u'Marlon Brando', u'Al Pacino', u'James Caan']
2 [u'Al Pacino', u'Robert De Niro', u'Robert Duv…
3 [u'Christian Bale', u'Heath Ledger', u'Aaron E…
4 [u'John Travolta', u'Uma Thurman', u'Samuel L…
[23]: film[film.Actors_list.str.contains(
"Brad Pitt")]
98
24 8.7 Se7en
106 8.3 Snatch.
114 8.3 Inglourious Basterds
264 8.1 Twelve Monkeys
508 7.8 The Curious Case of Benjamin Button
577 7.8 Ocean's Eleven
683 7.7 Fury
776 7.6 Moneyball
779 7.6 Interview with the Vampire: The Vampire Chroni…
807 7.6 The Assassination of Jesse James by the Coward…
826 7.5 Sleepers
877 7.5 Legends of the Fall
901 7.5 Babel
Actors_list
9 [u'Brad Pitt', u'Edward Norton', u'Helena Bonh…
24 [u'Morgan Freeman', u'Brad Pitt', u'Kevin Spac…
106 [u'Jason Statham', u'Brad Pitt', u'Benicio Del…
114 [u'Brad Pitt', u'Diane Kruger', u'Eli Roth']
264 [u'Bruce Willis', u'Madeleine Stowe', u'Brad P…
508 [u'Brad Pitt', u'Cate Blanchett', u'Tilda Swin…
577 [u'George Clooney', u'Brad Pitt', u'Julia Robe…
683 [u'Brad Pitt', u'Shia LaBeouf', u'Logan Lerman']
776 [u'Brad Pitt', u'Robin Wright', u'Jonah Hill']
779 [u'Brad Pitt', u'Tom Cruise', u'Antonio Bander…
807 [u'Brad Pitt', u'Casey Affleck', u'Sam Shepard']
826 [u'Robert De Niro', u'Kevin Bacon', u'Brad Pitt']
877 [u'Brad Pitt', u'Anthony Hopkins', u'Aidan Qui…
901 [u'Brad Pitt', u'Cate Blanchett', u'Gael Garc\…
[24]: film.Actors_list.str.replace("[","")
99
[24]: 0 u'Tim Robbins', u'Morgan Freeman', u'Bob Gunton']
1 u'Marlon Brando', u'Al Pacino', u'James Caan']
2 u'Al Pacino', u'Robert De Niro', u'Robert Duva…
3 u'Christian Bale', u'Heath Ledger', u'Aaron Ec…
4 u'John Travolta', u'Uma Thurman', u'Samuel L. …
…
974 u'Dustin Hoffman', u'Jessica Lange', u'Teri Ga…
975 u'Michael J. Fox', u'Christopher Lloyd', u'Mar…
976 u'Russell Crowe', u'Paul Bettany', u'Billy Boyd']
977 u'JoBeth Williams', u"Heather O'Rourke", u'Cra…
978 u'Charlie Sheen', u'Michael Douglas', u'Tamara…
Name: Actors_list, Length: 979, dtype: object
[25]: film.Actors_list.str.replace(
"[","").str.replace("]","")
If you want, you can download the data set from https://openpolicing.stanford.edu/data/. You
can also get information about the data set from this site. Let’s import the data set.
[2]: df=pd.read_csv(
"Data/ca_san_diego_2019_02_25.csv")
[3]: df.head()
100
3 4 2014-01-01 08:10:00 610 23.0
4 5 2014-01-01 08:35:00 930 35.0
[4]: df.tail(3)
reason_for_stop
101
390996 Radio Call/Citizen Contact
390997 Moving Violation
390998 Equipment Violation
[5]: df.shape
[6]: df.dtypes
[7]: df.isnull().sum()
[7]: raw_row_number 0
date 132
time 1256
service_area 0
subject_age 12644
subject_race 1398
subject_sex 806
type 0
arrest_made 35022
citation_issued 32712
warning_issued 32712
outcome 40047
contraband_found 379835
search_conducted 37096
102
search_person 376459
search_vehicle 376459
search_basis 374173
reason_for_search 376343
reason_for_stop 266
dtype: int64
[8]: df.date.head()
[8]: 0 2014-01-01
1 2014-01-01
2 2014-01-01
3 2014-01-01
4 2014-01-01
Name: date, dtype: object
[9]: df.columns
[10]: df["time"].head()
[10]: 0 01:25:00
1 05:47:00
2 07:46:00
3 08:10:00
4 08:35:00
Name: time, dtype: object
[11]: df[["date","time"]].head()
[12]: df.rename(columns={"date":"DATE",
"time":"TIME"},
inplace=True)
df.head()
103
[12]: raw_row_number DATE TIME service_area subject_age \
0 1 2014-01-01 01:25:00 110 24.0
1 2 2014-01-01 05:47:00 320 42.0
2 3 2014-01-01 07:46:00 320 29.0
3 4 2014-01-01 08:10:00 610 23.0
4 5 2014-01-01 08:35:00 930 35.0
[13]: df.iloc[0].head()
[13]: raw_row_number 1
DATE 2014-01-01
TIME 01:25:00
service_area 110
subject_age 24
Name: 0, dtype: object
[14]: df.iloc[0,1]
[14]: '2014-01-01'
[15]: df.iloc[0,[1,3,5]]
104
[16]: df.iloc[0:5,[1,3,5]]
[17]: df.iloc[0:5,0:5]
[18]: df.loc[1:5,"TIME":"type"]
type
1 vehicular
2 vehicular
3 vehicular
4 vehicular
5 vehicular
[19]: df.loc[0:5,"DATE":"type"].head()
subject_sex type
0 male vehicular
1 male vehicular
2 male vehicular
105
3 male vehicular
4 male vehicular
[20]: df.shape
[21]: df.dropna(axis="columns",how="all").shape
[22]: df.dropna(axis="columns",how="any").shape
[22]: (390999, 3)
[24]: df[df.reason_for_stop==
"Moving Violation"
].subject_sex.value_counts(normalize=True)
[25]: df[df.reason_for_stop==
"Moving Violation"
].subject_sex.value_counts(
normalize=True)
[26]: df[df.subject_sex==
"female"
].reason_for_stop.value_counts(
normalize=True).head()
106
[26]: Moving Violation 0.773651
Equipment Violation 0.215736
Radio Call/Citizen Contact 0.003606
Muni, County, H&S Code 0.002516
Personal Knowledge/Informant 0.001609
Name: reason_for_stop, dtype: float64
[27]: df.groupby(
"subject_sex"
).reason_for_stop.value_counts(
normalize=True).head()
[28]: df.groupby(
"subject_sex"
).reason_for_stop.value_counts(
normalize=True).unstack()
107
female 0.003606 0.000044
male 0.005690 0.000043
reason_for_stop Suspect Info (I.S., Bulletin, Log) UNI, &County, H&&S Code \
subject_sex
female 0.000929 0.000110
male 0.001619 0.000229
reason_for_stop none listed not listed not marked not marked not marked \
subject_sex
female 0.000007 NaN 0.000015 0.000007
male 0.000016 0.000004 0.000008 NaN
[2 rows x 26 columns]
[29]: df.arrest_made.value_counts()
[30]: df.arrest_made.value_counts(normalize=True)
[31]: df.groupby(
"subject_sex"
).arrest_made.value_counts(normalize=True)
[32]: df.groupby(
["subject_race","subject_sex"]
).arrest_made.value_counts(
normalize=True).head()
108
[32]: subject_race subject_sex arrest_made
asian/pacific islander female False 0.993134
True 0.006866
male False 0.987704
True 0.012296
black female False 0.985657
Name: arrest_made, dtype: float64
[33]: df.DATE.str.slice(0,4).value_counts()
[35]: df["stop_datetime"]=pd.to_datetime(combined)
df.head()
109
4 NaN NaN NaN Equipment Violation
stop_datetime
0 2014-01-01 01:25:00
1 2014-01-01 05:47:00
2 2014-01-01 07:46:00
3 2014-01-01 08:10:00
4 2014-01-01 08:35:00
[36]: df.dtypes
[37]: df.stop_datetime.dt.month.head()
[37]: 0 1.0
1 1.0
2 1.0
3 1.0
4 1.0
Name: stop_datetime, dtype: float64
[38]: df.arrest_made.head()
[38]: 0 False
1 False
2 False
110
3 False
4 False
Name: arrest_made, dtype: object
[39]: df["arrest_made"]=df.arrest_made.astype(bool)
[40]: df.arrest_made.value_counts()
[41]: df.arrest_made.mean()
[41]: 0.10214604129422326
[42]: df.groupby(
df.stop_datetime.dt.hour
).arrest_made.mean().head()
[42]: stop_datetime
0.0 0.113073
1.0 0.140684
2.0 0.141698
3.0 0.127541
4.0 0.105165
Name: arrest_made, dtype: float64
[44]: df.groupby(
df.stop_datetime.dt.hour
).arrest_made.mean().plot()
[44]: <AxesSubplot:xlabel='stop_datetime'>
111
[45]: df.stop_datetime.dt.hour.value_counts().head()
[46]: df.stop_datetime.dt.hour.value_counts().sort_index().head()
[47]: df.stop_datetime.dt.hour.value_counts().sort_index().plot()
[47]: <AxesSubplot:>
112
from file: 22-Multiple Selecting-Filtering
[2]: film=pd.read_csv("http://bit.ly/imdbratings")
[3]: film.head(3)
actors_list
0 [u'Tim Robbins', u'Morgan Freeman', u'Bob Gunt…
1 [u'Marlon Brando', u'Al Pacino', u'James Caan']
2 [u'Al Pacino', u'Robert De Niro', u'Robert Duv…
113
[4]: 0 The Shawshank Redemption
1 The Godfather
2 The Godfather: Part II
3 The Dark Knight
4 Pulp Fiction
Name: title, dtype: object
[5]: film[["title","genre"]].head()
[7]: film.loc[[0,2,4],]
actors_list
0 [u'Tim Robbins', u'Morgan Freeman', u'Bob Gunt…
2 [u'Al Pacino', u'Robert De Niro', u'Robert Duv…
4 [u'John Travolta', u'Uma Thurman', u'Samuel L…
[8]: film.loc[0:2,]
114
actors_list
0 [u'Tim Robbins', u'Morgan Freeman', u'Bob Gunt…
1 [u'Marlon Brando', u'Al Pacino', u'James Caan']
2 [u'Al Pacino', u'Robert De Niro', u'Robert Duv…
[9]: film.loc[0:5,"title"]
[10]: film.loc[0:5,"title":"genre"]
[11]: film.loc[0:5,"title":"duration"]
[12]: film.loc[:,"title":"genre"].head()
[13]: film.loc[film.genre=="Crime",].head(3)
115
[13]: star_rating title content_rating genre duration \
0 9.3 The Shawshank Redemption R Crime 142
1 9.2 The Godfather R Crime 175
2 9.1 The Godfather: Part II R Crime 200
actors_list
0 [u'Tim Robbins', u'Morgan Freeman', u'Bob Gunt…
1 [u'Marlon Brando', u'Al Pacino', u'James Caan']
2 [u'Al Pacino', u'Robert De Niro', u'Robert Duv…
[14]: film.loc[film.genre=="Crime",[
"title","duration"]]
[15]: film.loc[
film.genre=="Crime","title":"duration"]
116
22.3 iloc method
[16]: film.iloc[:,0].head()
[16]: 0 9.3
1 9.2
2 9.1
3 9.0
4 8.9
Name: star_rating, dtype: float64
[17]: film.columns
[18]: film.iloc[:,[0,3]].head()
[19]: film.iloc[:,0:3].head()
[20]: film.iloc[0,0:3]
[21]: film.iloc[0:5,0:3]
117
3 9.0 The Dark Knight PG-13
4 8.9 Pulp Fiction R
[22]: film.iloc[0:5,].head(3)
actors_list
0 [u'Tim Robbins', u'Morgan Freeman', u'Bob Gunt…
1 [u'Marlon Brando', u'Al Pacino', u'James Caan']
2 [u'Al Pacino', u'Robert De Niro', u'Robert Duv…
[23]: film.iloc[0:5,:].head(3)
actors_list
0 [u'Tim Robbins', u'Morgan Freeman', u'Bob Gunt…
1 [u'Marlon Brando', u'Al Pacino', u'James Caan']
2 [u'Al Pacino', u'Robert De Niro', u'Robert Duv…
## Multiple filtering
[24]: film.loc[film.duration>=200,].head(3)
[25]: film.loc[film.duration>=200,"title"].head()
118
85 Lawrence of Arabia
Name: title, dtype: object
[26]: film[(
film.duration>=200)|(
film.genre=="Crime")|(
film.genre=="Action")].head(3)
actors_list
0 [u'Tim Robbins', u'Morgan Freeman', u'Bob Gunt…
1 [u'Marlon Brando', u'Al Pacino', u'James Caan']
2 [u'Al Pacino', u'Robert De Niro', u'Robert Duv…
[27]: film[film.genre.isin([
"Crime","Drama","Action"])]
119
actors_list
0 [u'Tim Robbins', u'Morgan Freeman', u'Bob Gunt…
1 [u'Marlon Brando', u'Al Pacino', u'James Caan']
2 [u'Al Pacino', u'Robert De Niro', u'Robert Duv…
3 [u'Christian Bale', u'Heath Ledger', u'Aaron E…
4 [u'John Travolta', u'Uma Thurman', u'Samuel L…
.. …
970 [u'Michael Douglas', u'Tobey Maguire', u'Franc…
972 [u'Ryan Gosling', u'Michelle Williams', u'John…
973 [u'Tobey Maguire', u'Charlize Theron', u'Micha…
976 [u'Russell Crowe', u'Paul Bettany', u'Billy Bo…
978 [u'Charlie Sheen', u'Michael Douglas', u'Tamar…
[2]: date=[datetime(2020,1,5),
datetime(2020,1,10),
datetime(2020,1,15),
datetime(2020,1,20),
datetime(2020,1,25)]
[3]: ts=pd.Series(np.random.randn(5),index=date)
ts
[4]: ts.index
120
23.2 Time Series Data Structures
[5]: pd.to_datetime("01/01/2020")
[6]: dates=pd.to_datetime(
[datetime(2020,7,5),
"6th of July, 2020",
"2020-Jul-7",
"20200708"])
dates
[7]: dates.to_period("D")
[8]: dates-dates[0]
[11]: pd.date_range("2020-07-15",
periods=10,
freq="H")
121
[11]: DatetimeIndex(['2020-07-15 00:00:00', '2020-07-15 01:00:00',
'2020-07-15 02:00:00', '2020-07-15 03:00:00',
'2020-07-15 04:00:00', '2020-07-15 05:00:00',
'2020-07-15 06:00:00', '2020-07-15 07:00:00',
'2020-07-15 08:00:00', '2020-07-15 09:00:00'],
dtype='datetime64[ns]', freq='H')
[12]: pd.period_range("2020-10",
periods=10,
freq="M")
[13]: pd.timedelta_range(0,periods=8,freq="H")
[13]: TimedeltaIndex(['0 days 00:00:00', '0 days 01:00:00', '0 days 02:00:00',
'0 days 03:00:00', '0 days 04:00:00', '0 days 05:00:00',
'0 days 06:00:00', '0 days 07:00:00'],
dtype='timedelta64[ns]', freq='H')
[14]: stamp=ts.index[1]
stamp
[15]: ts[stamp]
[15]: 0.1320158576081964
[16]: ts["25.1.2020"]
[16]: 1.0438287426871447
[17]: ts["20200125"]
[17]: 1.0438287426871447
[18]: long_ts=pd.Series(
np.random.randn(1000),
index=pd.date_range("1/1/2020",
periods=1000))
long_ts.head()
122
2020-01-03 -1.171112
2020-01-04 0.140425
2020-01-05 1.861661
Freq: D, dtype: float64
[19]: long_ts["2020"].head()
[20]: long_ts["2020-10"].head(15)
[21]: long_ts[datetime(2022,9,20):]
123
23.4 The Important Methods Used in Time Series
[22]: ts
[23]: ts.truncate(after="1/15/2020")
[24]: date=pd.date_range("1/1/2020",
periods=100,
freq="W-SUN")
[25]: long_df=pd.DataFrame(np.random.randn(100,4),
index=date,
columns=list("ABCD"))
long_df.head()
[25]: A B C D
2020-01-05 0.298570 0.541989 0.270855 0.812892
2020-01-12 0.559454 0.052274 -0.129981 -1.355922
2020-01-19 -0.936940 -1.359541 0.060583 0.055678
2020-01-26 1.413215 -1.267353 0.260763 1.158904
2020-02-02 -0.480138 1.104503 0.925559 -0.183879
[26]: long_df["2020-10"]
[26]: A B C D
2020-10-04 0.273074 0.634981 -0.887838 0.148878
2020-10-11 0.108231 -0.053346 -0.520075 -1.125796
2020-10-18 -0.757139 -0.287251 -1.123849 1.163562
2020-10-25 0.550777 0.295018 0.622508 -0.551081
[27]: date=pd.DatetimeIndex(
["1/1/2020","1/2/2020","1/2/2020",
"1/2/2020","1/3/2020"])
ts1=pd.Series(np.arange(5),index=date)
ts1
124
[27]: 2020-01-01 0
2020-01-02 1
2020-01-02 2
2020-01-02 3
2020-01-03 4
dtype: int32
[28]: ts1.index.is_unique
[28]: False
[29]: group=ts1.groupby(level=0)
[30]: group.count()
[30]: 2020-01-01 1
2020-01-02 3
2020-01-03 1
dtype: int64
[31]: group.mean()
[31]: 2020-01-01 0
2020-01-02 2
2020-01-03 4
dtype: int32
[2]: date=pd.date_range(
start="2018",end="2019", freq="BM")
[3]: ts=pd.Series(
np.random.randn(len(date)),index=date)
ts
125
2018-08-31 0.181651
2018-09-28 -2.518869
2018-10-31 1.428868
2018-11-30 -0.357551
2018-12-31 0.612771
Freq: BM, dtype: float64
[4]: ts.index
[5]: ts[:5].index
[7]: fb.head()
Volume
0 65280800
1 40356500
2 34042100
3 32400000
4 24763400
126
High float64
Low float64
Close float64
Adj Close float64
Volume int64
dtype: object
[9]: fb=pd.read_csv(
"FB.csv", parse_dates=["Date"])
[10]: fb=pd.read_csv(
"FB.csv",
parse_dates=["Date"],
index_col="Date")
[11]: fb.index
[12]: fb.head()
Volume
Date
2018-07-30 65280800
2018-07-31 40356500
2018-08-01 34042100
2018-08-02 32400000
2018-08-03 24763400
[13]: fb["2019-06"]
127
[13]: Open High Low Close Adj Close \
Date
2019-06-03 175.000000 175.050003 161.009995 164.149994 164.149994
2019-06-04 163.710007 168.279999 160.839996 167.500000 167.500000
2019-06-05 167.479996 168.720001 164.630005 168.169998 168.169998
2019-06-06 168.300003 169.699997 167.229996 168.330002 168.330002
2019-06-07 170.169998 173.869995 168.839996 173.350006 173.350006
2019-06-10 174.750000 177.860001 173.800003 174.820007 174.820007
2019-06-11 178.479996 179.979996 176.789993 178.100006 178.100006
2019-06-12 178.380005 179.270004 172.880005 175.039993 175.039993
2019-06-13 175.529999 178.029999 174.610001 177.470001 177.470001
2019-06-14 180.509995 181.839996 180.000000 181.330002 181.330002
Volume
Date
2019-06-03 56059600
2019-06-04 46044300
2019-06-05 19758300
2019-06-06 12446400
2019-06-07 16917300
2019-06-10 14767900
2019-06-11 15266600
2019-06-12 17699800
2019-06-13 12253600
2019-06-14 16773700
[14]: 181.27450025000002
[15]: fb["2019-07-05":"2019-07-10"]
Volume
Date
2019-07-05 11164100
2019-07-08 9723900
2019-07-09 14698600
2019-07-10 20571700
128
[16]: t=pd.to_datetime("7/22/2019")
t
[17]: fb.loc[fb.index>=t,:]
Volume
Date
2019-07-22 13589000
2019-07-23 14583700
2019-07-24 32532500
2019-07-25 39889900
2019-07-26 24426700
2019-07-29 754198
[19]: fb1.head()
[20]: dates=pd.date_range(start="03/01/2019",
end="03/29/2019",
freq="B")
dates
129
'2019-03-25', '2019-03-26', '2019-03-27', '2019-03-28',
'2019-03-29'],
dtype='datetime64[ns]', freq='B')
[21]: fb1.set_index(dates,inplace=True)
[22]: fb1.head()
[23]: fb1.index
[25]: fb1.Close.plot()
[25]: <AxesSubplot:>
130
[26]: fb1.asfreq("H",method="pad").head()
Volume
2019-03-01 00:00:00 11097800
2019-03-01 01:00:00 11097800
2019-03-01 02:00:00 11097800
2019-03-01 03:00:00 11097800
2019-03-01 04:00:00 11097800
131
2019-03-24 165649994 167419998 164089996 164339996 164339996 16389200
Volume
2019-03-01 00:00:00 11097800
2019-03-01 01:00:00 11097800
2019-03-01 02:00:00 11097800
2019-03-01 03:00:00 11097800
2019-03-01 04:00:00 11097800
… …
2019-03-28 20:00:00 10443000
2019-03-28 21:00:00 10443000
2019-03-28 22:00:00 10443000
2019-03-28 23:00:00 10443000
2019-03-29 00:00:00 13455500
[29]: z=pd.date_range(start="3/1/2019",
periods=60 , freq="B")
z
132
'2019-05-02', '2019-05-03', '2019-05-06', '2019-05-07',
'2019-05-08', '2019-05-09', '2019-05-10', '2019-05-13',
'2019-05-14', '2019-05-15', '2019-05-16', '2019-05-17',
'2019-05-20', '2019-05-21', '2019-05-22', '2019-05-23'],
dtype='datetime64[ns]', freq='B')
[30]: z=pd.date_range(
start="3/1/2019", periods=30, freq="H")
z
[31]: ts=pd.Series(
np.random.randint(1,10,len(z)),index=z)
ts.head()
133
25.1 to_datetime method
[2]: pd.to_datetime("15/08/2019")
C:\Users\lenovo\AppData\Local\Temp\ipykernel_15352\2752892349.py:1: UserWarning:
Parsing '15/08/2019' in DD/MM/YYYY format. Provide format or specify
infer_datetime_format=True for consistent parsing.
pd.to_datetime("15/08/2019")
[4]: pd.to_datetime(date)
[5]: pd.to_datetime("03/05/2019")
[7]: pd.to_datetime("05*03*2019",
format="%d*%m*%Y" )
[8]: pd.to_datetime("05$03$2019",
format="%d$%m$%Y")
[9]: date=["2019-01-05",
"jan 6, 2019",
"7/05/2019",
"2019/01/9",
"20190110"]
[10]: pd.to_datetime(date)
134
dtype='datetime64[ns]', freq=None)
[12]: t=1000000000
[15]: pd.date_range(
"2010-01-01","2010-09-03",
freq="WOM-4SUN")
[17]: dir(p)
135
[17]: ['__add__',
'__array_priority__',
'__class__',
'__delattr__',
'__dict__',
'__dir__',
'__doc__',
'__eq__',
'__format__',
'__ge__',
'__getattribute__',
'__gt__',
'__hash__',
'__init__',
'__init_subclass__',
'__le__',
'__lt__',
'__module__',
'__ne__',
'__new__',
'__pyx_vtable__',
'__radd__',
'__reduce__',
'__reduce_ex__',
'__repr__',
'__rsub__',
'__setattr__',
'__setstate__',
'__sizeof__',
'__str__',
'__sub__',
'__subclasshook__',
'__weakref__',
'_add_offset',
'_add_timedeltalike_scalar',
'_dtype',
'_from_ordinal',
'_get_to_timestamp_base',
'_maybe_convert_freq',
'_require_matching_freq',
'asfreq',
'day',
'day_of_week',
'day_of_year',
'dayofweek',
'dayofyear',
'days_in_month',
136
'daysinmonth',
'end_time',
'freq',
'freqstr',
'hour',
'is_leap_year',
'minute',
'month',
'now',
'ordinal',
'quarter',
'qyear',
'second',
'start_time',
'strftime',
'to_timestamp',
'week',
'weekday',
'weekofyear',
'year']
[18]: p.start_time
[19]: p.end_time
[21]: a+5
[22]: a-3
[23]: p-pd.Period("2015")
137
[24]: rng=pd.period_range(
"2019-01-01","2019-08-30",freq="M")
rng
[25]: pd.Series(range(8),index=rng)
[25]: 2019-01 0
2019-02 1
2019-03 2
2019-04 3
2019-05 4
2019-06 5
2019-07 6
2019-08 7
Freq: M, dtype: int64
[26]: p=pd.Period("2019",freq="A-DEC")
p
[28]: p.asfreq("M",how="end")
[31]: p.end_time
138
[33]: rng=pd.period_range(
"2019Q3","2020Q4", freq="Q-JAN")
rng
[34]: ts=pd.Series(range(len(rng)),index=rng)
ts
[34]: 2019Q3 0
2019Q4 1
2020Q1 2
2020Q2 3
2020Q3 4
2020Q4 5
Freq: Q-JAN, dtype: int64
[35]: rng=pd.date_range(
"2020-01-01", periods=5, freq="M")
[36]: ts=pd.Series(range(len(rng)),index=rng)
ts
[36]: 2020-01-31 0
2020-02-29 1
2020-03-31 2
2020-04-30 3
2020-05-31 4
Freq: M, dtype: int64
[37]: pts=ts.to_period()
pts.index
[38]: pts.index
139
26 Important Methods for Time Series in Pandas
[1]: import pandas as pd ; import numpy as np
[3]: fb.head()
Volume
Date
2018-07-30 65280800
2018-07-31 40356500
2018-08-01 34042100
2018-08-02 32400000
2018-08-03 24763400
[4]: fb.resample("M").mean()
Volume
140
Date
2018-07-31 5.281865e+07
2018-08-31 2.386229e+07
2018-09-30 2.634046e+07
2018-10-31 2.706288e+07
2018-11-30 2.467389e+07
2018-12-31 2.940980e+07
2019-01-31 2.512133e+07
2019-02-28 1.590754e+07
2019-03-31 1.847315e+07
2019-04-30 1.818978e+07
2019-05-31 1.303734e+07
2019-06-30 2.132143e+07
2019-07-31 1.543907e+07
[6]: fb.Close.resample("M").mean().plot()
[6]: <AxesSubplot:xlabel='Date'>
[7]: fb.Close.resample("Q").mean().plot(
kind="bar")
141
[7]: <AxesSubplot:xlabel='Date'>
26.2 Shifting
[8]: fb1=pd.DataFrame(fb.Close["2019-03"])
[9]: fb1.head()
[9]: Close
Date
2019-03-01 162.279999
2019-03-04 167.369995
2019-03-05 171.259995
2019-03-06 172.509995
2019-03-07 169.130005
[10]: fb1.shift(2)
142
[10]: Close
Date
2019-03-01 NaN
2019-03-04 NaN
2019-03-05 162.279999
2019-03-06 167.369995
2019-03-07 171.259995
2019-03-08 172.509995
2019-03-11 169.130005
2019-03-12 169.600006
2019-03-13 172.070007
2019-03-14 171.919998
2019-03-15 173.369995
2019-03-18 170.169998
2019-03-19 165.979996
2019-03-20 160.470001
2019-03-21 161.570007
2019-03-22 165.440002
2019-03-25 166.080002
2019-03-26 164.339996
2019-03-27 166.289993
2019-03-28 167.679993
2019-03-29 165.869995
[11]: fb1.shift(-2)
[11]: Close
Date
2019-03-01 171.259995
2019-03-04 172.509995
2019-03-05 169.130005
2019-03-06 169.600006
2019-03-07 172.070007
2019-03-08 171.919998
2019-03-11 173.369995
2019-03-12 170.169998
2019-03-13 165.979996
2019-03-14 160.470001
2019-03-15 161.570007
2019-03-18 165.440002
2019-03-19 166.080002
2019-03-20 164.339996
2019-03-21 166.289993
2019-03-22 167.679993
2019-03-25 165.869995
2019-03-26 165.550003
2019-03-27 166.690002
143
2019-03-28 NaN
2019-03-29 NaN
[13]: fb1
[15]: fb1.head()
144
"Previous Price"]
[17]: fb2=fb1[["Close"]]
[18]: fb2.head()
[18]: Close
Date
2019-03-01 162.279999
2019-03-04 167.369995
2019-03-05 171.259995
2019-03-06 172.509995
2019-03-07 169.130005
[19]: fb2.index
[20]: fb2.index=pd.date_range(
"2019-03-01", periods=21, freq="B")
[21]: fb2.index
[22]: fb2.tshift(1)
[22]: Close
2019-03-04 162.279999
2019-03-05 167.369995
2019-03-06 171.259995
145
2019-03-07 172.509995
2019-03-08 169.130005
2019-03-11 169.600006
2019-03-12 172.070007
2019-03-13 171.919998
2019-03-14 173.369995
2019-03-15 170.169998
2019-03-18 165.979996
2019-03-19 160.470001
2019-03-20 161.570007
2019-03-21 165.440002
2019-03-22 166.080002
2019-03-25 164.339996
2019-03-26 166.289993
2019-03-27 167.679993
2019-03-28 165.869995
2019-03-29 165.550003
2019-04-01 166.690002
[23]: fb2.tshift(-2)
[23]: Close
2019-02-27 162.279999
2019-02-28 167.369995
2019-03-01 171.259995
2019-03-04 172.509995
2019-03-05 169.130005
2019-03-06 169.600006
2019-03-07 172.070007
2019-03-08 171.919998
2019-03-11 173.369995
2019-03-12 170.169998
2019-03-13 165.979996
2019-03-14 160.470001
2019-03-15 161.570007
2019-03-18 165.440002
2019-03-19 166.080002
2019-03-20 164.339996
2019-03-21 166.289993
2019-03-22 167.679993
2019-03-25 165.869995
2019-03-26 165.550003
2019-03-27 166.690002
146
26.3 Moving Window Functions
[24]: fb.Close.plot()
[24]: <AxesSubplot:xlabel='Date'>
[25]: fb.Close.plot()
fb.Close.rolling(30).mean().plot()
[25]: <AxesSubplot:xlabel='Date'>
147
26.4 Time Zone Handling
[26]: import pytz
[27]: pytz.timezone("Turkey")
[28]: pytz.timezone("America/New_York")
[29]: pytz.common_timezones[-7:]
[29]: ['US/Arizona',
'US/Central',
'US/Eastern',
'US/Hawaii',
'US/Mountain',
'US/Pacific',
'UTC']
148
[31]: ts=pd.Series(np.random.randn(len(x)),
index=x)
[32]: ts
[33]: print(ts.index.tz)
None
[34]: ts_utc=ts.tz_localize("UTC")
ts_utc
[35]: ts_utc.tz_convert("US/Hawaii")
[37]: zstamp_utc=zstamp.tz_localize("utc")
zstamp_utc
149
[38]: zstamp_utc.tz_convert("Europe/Istanbul")
[39]: ts
[40]: ts1=ts[:5].tz_localize("Europe/Berlin")
ts2=ts[2:].tz_localize("Europe/Istanbul")
[41]: result=ts1+ts2
[42]: result.index
[3]: plt.style.use("fivethirtyeight")
[5]: data.plot()
150
[5]: <AxesSubplot:>
[8]: df1.plot()
[8]: <AxesSubplot:>
151
27.2 Bar Charts
[9]: df1.iloc[10].plot(kind='bar')
[9]: <AxesSubplot:>
152
[10]: df1.iloc[10].plot.bar()
[10]: <AxesSubplot:>
153
[11]: df2=pd.DataFrame(np.random.rand(7,3), columns=list("ABC"))
[12]: df2.plot.bar()
[12]: <AxesSubplot:>
[13]: df2.plot.bar(stacked=True)
[13]: <AxesSubplot:>
154
[14]: df2.plot.barh(stacked=True)
[14]: <AxesSubplot:>
155
27.3 Histograms
[15]: iris=pd.read_csv("iris.data", header=None)
iris.columns=["sepal_length","sepal_width", "petal_length",
"petal_width", "species"]
[17]: iris.plot.hist(alpha=0.7)
[17]: <AxesSubplot:ylabel='Frequency'>
[18]: <AxesSubplot:ylabel='Frequency'>
156
[19]: bins=25
[20]: <AxesSubplot:ylabel='Frequency'>
157
[21]: iris["sepal_width"].plot.hist(orientation="horizontal")
[21]: <AxesSubplot:xlabel='Frequency'>
[22]: iris["sepal_length"].diff().hist()
[22]: <AxesSubplot:>
158
[23]: iris.hist(color="blue", alpha=1, bins=20)
[23]: array([[<AxesSubplot:title={'center':'sepal_length'}>,
<AxesSubplot:title={'center':'sepal_width'}>],
[<AxesSubplot:title={'center':'petal_length'}>,
<AxesSubplot:title={'center':'petal_width'}>]], dtype=object)
159
[24]: iris.hist("petal_length",by="species")
[24]: array([[<AxesSubplot:title={'center':'Iris-setosa'}>,
<AxesSubplot:title={'center':'Iris-versicolor'}>],
[<AxesSubplot:title={'center':'Iris-virginica'}>, <AxesSubplot:>]],
dtype=object)
160
27.4 Boxplot charts
[25]: iris.plot.box()
[25]: <AxesSubplot:>
161
[26]: colors={'boxes': 'Red', 'whiskers': 'blue','medians': 'Black', 'caps': 'Green'}
[27]: iris.plot.box(color=colors)
[27]: <AxesSubplot:>
162
[28]: iris.plot.box(vert=False)
[28]: <AxesSubplot:>
163
[29]: iris.boxplot()
[29]: <AxesSubplot:>
[30]: plt.rcParams["figure.figsize"]=(8,8)
plt.style.use("ggplot")
iris.boxplot(by='species')
164
from file: 28-Data Visualization with Pandas - Part 2
[2]: plt.style.use("fivethirtyeight")
165
28.1 Area Charts
[3]: df = pd.DataFrame(np.random.rand(10, 4), columns=list("ABCD"))
df.head()
[3]: A B C D
0 0.304421 0.862368 0.924565 0.216823
1 0.278892 0.909381 0.955020 0.877742
2 0.537353 0.113712 0.967656 0.526703
3 0.588321 0.331512 0.036084 0.299474
4 0.253378 0.175976 0.610472 0.326040
[4]: df["A"].plot.area()
[4]: <AxesSubplot:>
[5]: df.plot.area()
[5]: <AxesSubplot:>
166
[6]: df.plot.area(stacked=False)
[6]: <AxesSubplot:>
167
[7]: iris=pd.read_csv("iris.data", header=None)
[9]: iris.dtypes
[10]: iris.plot.area()
[10]: <AxesSubplot:>
[11]: iris.plot.area(stacked=False)
[11]: <AxesSubplot:>
168
28.2 Scatter Plots
[12]: df.plot.scatter(x='A', y='B')
169
[13]: movies=pd.read_csv("imdbratings.txt")
[14]: movies.head()
actors_list
0 [u'Tim Robbins', u'Morgan Freeman', u'Bob Gunt…
1 [u'Marlon Brando', u'Al Pacino', u'James Caan']
2 [u'Al Pacino', u'Robert De Niro', u'Robert Duv…
3 [u'Christian Bale', u'Heath Ledger', u'Aaron E…
4 [u'John Travolta', u'Uma Thurman', u'Samuel L…
[15]: movies.dtypes
170
duration int64
actors_list object
dtype: object
171
[18]: iris.plot.scatter(x='sepal_length', y='sepal_width',
c='petal_length', s=100)
172
[19]: iris.plot.scatter(x='sepal_length', y='sepal_width',
s=iris['petal_length'] * 50)
173
28.3 Hexagonal bin charts
[20]: movies.plot.hexbin(x="star_rating", y="duration", gridsize=25)
174
[21]: movies.plot.hexbin(x="star_rating", y="duration", gridsize=10)
175
28.4 Pie Charts
[22]: iris_avg=iris["petal_width"].groupby(iris["species"]).mean()
iris_avg
[22]: species
Iris-setosa 0.244
Iris-versicolor 1.326
Iris-virginica 2.026
Name: petal_width, dtype: float64
[23]: iris_avg.plot.pie()
[23]: <AxesSubplot:ylabel='petal_width'>
176
[24]: iris_avg_2=iris[["petal_width",
"petal_length"]].groupby(iris["species"]).mean()
[25]: iris_avg_2.plot.pie(subplots=True)
[25]: array([<AxesSubplot:ylabel='petal_width'>,
<AxesSubplot:ylabel='petal_length'>], dtype=object)
[26]: iris_avg.plot.pie()
177
[26]: <AxesSubplot:ylabel='petal_width'>
[27]: <AxesSubplot:ylabel='petal_width'>
178
[28]: iris_avg.plot.pie(labels=["setosa","versicolor", "virginica"],␣
↪colors=list("brg"),
[28]: <AxesSubplot:ylabel='petal_width'>
179
28.5 Density chart
[29]: iris.plot.kde()
[29]: <AxesSubplot:ylabel='Density'>
180
28.6 Scatter matrix
[30]: from pandas.plotting import scatter_matrix
181
from file: MERGED_NOTEBOOK
[2]: plt.style.use("fivethirtyeight")
[3]: A B C D
0 0.304421 0.862368 0.924565 0.216823
1 0.278892 0.909381 0.955020 0.877742
182
2 0.537353 0.113712 0.967656 0.526703
3 0.588321 0.331512 0.036084 0.299474
4 0.253378 0.175976 0.610472 0.326040
[4]: df["A"].plot.area()
[4]: <AxesSubplot:>
[5]: df.plot.area()
[5]: <AxesSubplot:>
183
[6]: df.plot.area(stacked=False)
[6]: <AxesSubplot:>
184
[7]: iris=pd.read_csv("iris.data", header=None)
[9]: iris.dtypes
[10]: iris.plot.area()
[10]: <AxesSubplot:>
[11]: iris.plot.area(stacked=False)
[11]: <AxesSubplot:>
185
30.2 Scatter Plots
[12]: df.plot.scatter(x='A', y='B')
186
[13]: movies=pd.read_csv("imdbratings.txt")
[14]: movies.head()
actors_list
0 [u'Tim Robbins', u'Morgan Freeman', u'Bob Gunt…
1 [u'Marlon Brando', u'Al Pacino', u'James Caan']
2 [u'Al Pacino', u'Robert De Niro', u'Robert Duv…
3 [u'Christian Bale', u'Heath Ledger', u'Aaron E…
4 [u'John Travolta', u'Uma Thurman', u'Samuel L…
[15]: movies.dtypes
187
duration int64
actors_list object
dtype: object
188
[18]: iris.plot.scatter(x='sepal_length', y='sepal_width',
c='petal_length', s=100)
189
[19]: iris.plot.scatter(x='sepal_length', y='sepal_width',
s=iris['petal_length'] * 50)
190
30.3 Hexagonal bin charts
[20]: movies.plot.hexbin(x="star_rating", y="duration", gridsize=25)
191
[21]: movies.plot.hexbin(x="star_rating", y="duration", gridsize=10)
192
30.4 Pie Charts
[22]: iris_avg=iris["petal_width"].groupby(iris["species"]).mean()
iris_avg
[22]: species
Iris-setosa 0.244
Iris-versicolor 1.326
Iris-virginica 2.026
Name: petal_width, dtype: float64
[23]: iris_avg.plot.pie()
[23]: <AxesSubplot:ylabel='petal_width'>
193
[24]: iris_avg_2=iris[["petal_width",
"petal_length"]].groupby(iris["species"]).mean()
[25]: iris_avg_2.plot.pie(subplots=True)
[25]: array([<AxesSubplot:ylabel='petal_width'>,
<AxesSubplot:ylabel='petal_length'>], dtype=object)
[26]: iris_avg.plot.pie()
194
[26]: <AxesSubplot:ylabel='petal_width'>
[27]: <AxesSubplot:ylabel='petal_width'>
195
[28]: iris_avg.plot.pie(labels=["setosa","versicolor", "virginica"],␣
↪colors=list("brg"),
[28]: <AxesSubplot:ylabel='petal_width'>
196
30.5 Density chart
[29]: iris.plot.kde()
[29]: <AxesSubplot:ylabel='Density'>
197
30.6 Scatter matrix
[30]: from pandas.plotting import scatter_matrix
198
199