Pandas Learndatasci
Pandas Learndatasci
Tutorial: A Complete
Introduction for
Beginners
Learn some of the most important pandas features for
exploring, cleaning, transforming, visualizing, and learning
from data.
This tool is essentially your data’s home. Through pandas, you get
acquainted with your data by cleaning, transforming, and analyzing it.
For example, say you want to explore a dataset stored in a CSV on your
computer. Pandas will extract the data from that CSV into a DataFrame
— a table, basically — then let you do things like:
if fruit == 'Apple':
print('The fruit is an apple')
elif fruit == 'Orange':
print('The fruit is an orange')
else:
print('The fruit is unidentified')
'apple'
'The fruit is an orange'
'The fruit is an apple'
'The fruit is unidentified'
I don't know yet
PrevNext
SCORE
OR
import pandas as pd
You'll see how these components work when we start working with data
below.
There are many ways to create a DataFrame from scratch, but a great
option is to just use a simple dict .
Let's say we have a fruit stand that sells apples and oranges. We want
to have a column for each fruit and a row for each customer purchase.
To organize this as a dictionary for pandas we could do something like:
data = {
'oranges': [0, 3, 7, 2]
}
And then pass it to the pandas DataFrame constructor:
purchases = pd.DataFrame(data)
purchases
OUT:
apple
oranges
s
0 3 0
1 2 3
2 0 7
3 1 2
purchases
OUT:
apples oranges
June 3 0
Rober 2 3
apples oranges
Lily 0 7
David 1 2
purchases.loc['June']
OUT:
apples 3
oranges 0
There's more on locating and extracting data from the DataFrame later,
but now you should be able to create a DataFrame with any random
data to learn on.
df = pd.read_csv('purchases.csv')
df
OUT:
0 June 3 0
1 Robert 2 3
2 Lily 0 7
3 David 1 2
df = pd.read_csv('purchases.csv', index_col=0)
df
OUT:
apples oranges
June 3 0
Rober
2 3
t
apples oranges
Lily 0 7
David 1 2
You'll find that most CSVs won't ever have an index column and so
usually you don't have to worry about this step.
df = pd.read_json('purchases.json')
df
OUT:
apples oranges
David 1 2
June 3 0
Lily 0 7
Rober
2 3
t
Notice this time our index came with us correctly since using JSON
allowed indexes to work through nesting. Feel free to
open data_file.json in a notepad so you can see how it works.
Pandas will try to figure out how to create a DataFrame by analyzing
structure of your JSON, and sometimes it doesn't get it right. Often you'll
need to set the orient keyword argument depending on the structure, so
check out read_json docs about that argument to see which orientation
you're using.
import sqlite3
con = sqlite3.connect("database.db")
SQL Tip
If you have data in PostgreSQL, MySQL, or some other SQL
server, you'll need to obtain the right Python library to make
a connection. For example, psycopg2 (link) is a commonly used
library for making connections to PostgreSQL. Furthermore, you
would make a connection to a database URI instead of a file
like we did here with SQLite.
For a great course on SQL check out The Complete SQL
Bootcamp on Udemy
In this SQLite database we have a table called purchases, and our index
is in a column called "index".
By passing a SELECT query and our con , we can read from
the purchases table:
df
OUT:
0 June 3 0
1 Robert 2 3
2 Lily 0 7
3 David 1 2
Just like with CSVs, we could pass index_col='index' , but we can also set
an index after-the-fact:
df = df.set_index('index')
df
OUT:
apples oranges
index
June 3 0
Rober
2 3
t
Lily 0 7
apples oranges
index
David 1 2
df.to_csv('new_purchases.csv')
df.to_json('new_purchases.json')
df.to_sql('new_purchases', con)
When we save JSON and CSV files, all we have to input into those
functions is our desired filename with the appropriate file extension.
With SQL, we’re not creating a new file but instead inserting a new table
into the database using our con variable from before.
Let's move on to importing some real-world data and detailing a few of
the operations you'll be using a lot.
We're loading this dataset from a CSV and designating the movie titles
to be our index.
movies_df.head()
OUT:
Ru Re
nti ven
Des Dir Ra Me
Ra Acto Ye me Vo ue
Genre crip ecto tin tasc
nk rs ar (Mi tes (Mi
tion r g ore
nut llio
es) ns)
T
it
le
G 1 Acti A J Ch 2 1 8 7 3 7
u on, g a ris 0 2 . 5 3 6
Adv r m Pra 1 1 1 7 3 .
a entu o e tt, 4 0 . 0
r re,S u s Vi 7 1
di ci- p G n 4 3
a Fi o u Di
n f n ese
s in n l,
te Br
of
Ru Re
nti ven
Des Dir Ra Me
Ra Acto Ye me Vo ue
Genre crip ecto tin tasc
nk rs ar (Mi tes (Mi
tion r g ore
nut llio
es) ns)
T
it
le
r
g
al
a
ct
ic
t
cr adl
h i ey
e m Co
G in op
al al er,
a s Zo
ar e
x e S...
y f
o
rc
e
d
...
P 2 Adv F R No 2 1 7 4 1 6
r entu ol i om 0 2 . 8 2 5
re, lo d i 1 4 0 5 6 .
o Mys w le Ra 2 8 . 0
m tery, in y pac 2 4
et Sci- g S e, 0 6
h Fi cl c Lo
e u o ga
u e tt n
s Ma
s to rsh
th all-
e Gr
o een
ri ,
gi Mi
n cha
Ru Re
nti ven
Des Dir Ra Me
Ra Acto Ye me Vo ue
Genre crip ecto tin tasc
nk rs ar (Mi tes (Mi
tion r g ore
nut llio
es) ns)
T
it
le
o
f
m
a
n el
ki Fa.
n ..
d,
a
te
...
S 3 Hor T M Ja 2 1 7 1 1 6
pl ror, h . me 0 1 . 5 3 2
Thri re N s 1 7 3 7 8 .
it ller e i Mc 6 6 . 0
gi g Av 0 1
rl h oy, 6 2
s t An
ar S ya
e h Ta
ki y ylo
d a r-
n m Joy
a al ,
p a Ha
p n ley
e Lu
d Ric
b har
y ...
a
m
a
n
w
it
h
Ru Re
nti ven
Des Dir Ra Me
Ra Acto Ye me Vo ue
Genre crip ecto tin tasc
nk rs ar (Mi tes (Mi
tion r g ore
nut llio
es) ns)
T
it
le
a
di
a
g.
..
I
n
a
ci
ty
o Ma
f tth
h ew
C
u Mc
h
m Co
ri
a na
st
n ug
Ani o
oi he 2
mati p 6
Si d y,R 2 7 5
on, h 1 7 0
a ees 0 0 9
n 4 Co
ni
e
e 1
0 . 5
. .
g med L 8 2 4
m Wi 6 3 0
y,Fa o 5
al the 2
mily u
s, rsp
r
a oo
d
h n,
el
u Set
et
st h
li Ma
n ...
g
th
e
a.
..
S 5 Acti A D Wi 2 1 6 3 3 4
Ru Re
nti ven
Des Dir Ra Me
Ra Acto Ye me Vo ue
Genre crip ecto tin tasc
nk rs ar (Mi tes (Mi
tion r g ore
nut llio
es) ns)
T
it
le
s
e
cr
et
g
o
v
ll
er
Sm
n
ith,
m
ui Jar
e
ed
ci nt a
Let
d on, a v
o, 9 2
e Adv g i
Ma 0 3 5 0
entu e d 2 .
S re,F n A
rgo 1
3 2
7 . .
q t 6 2 0 0
anta c y
u Ro 7 2
sy y e
bbi
a re r
e,
d cr
Vi
ui
ola
ts
D..
s
.
o
m
e
o
f
th
...
.head() outputs the first five rows of your DataFrame by default, but we
could also pass a number as well: movies_df.head(10) would output the
top ten rows, for example.
To see the last five rows use .tail() . tail() also accepts a number,
and in this case we printing the bottom two rows.:
movies_df.tail(2)
OUT:
Ru Rev
nti enu
Des Dir Ra Met
Ra Genr Acto Ye me Vo e
crip ecto tin asc
nk e rs ar (Mi tes (Mi
tion r g ore
nut llio
es) ns)
T
i
t
l
e
A
pa
ir
of Ad
fri am
S en Pal
e ds ly,
e S T.J
a m c .
r ba ot Mi
c Adv
rk A lle 2 4 2
9 entu 5 N
h 9 re,C
o r r, 0 9
.
8
a
2
n m Th 1 3 8 .
9 ome 6 N
P a st om 4 1 0
dy
a m ro as
is n Mi
r si g ddl
t o edi
y n tch
to ,S
re h...
u
ni
...
N 1 Co A B Ke 2 8 5 1 1 1
i 0 med st ar vin 0 7 . 2 9 1
0 y,Fa uf ry Sp 1 3 4 . .
n 0 mil fy S ac 6 3 6 0
e y,Fa b o ey, 5 4
L ntas us n Je
Ru Rev
nti enu
Des Dir Ra Met
Ra Genr Acto Ye me Vo e
crip ecto tin asc
nk e rs ar (Mi tes (Mi
tion r g ore
nut llio
es) ns)
T
i
t
l
e
in
es
s
m nni
an fer
fi Ga
n rne
i ds n r,
hi e Ro
v
y m nf bbi
e se el e
s lf d A
tr me
ap ll,
pe Ch
d ...
in
s..
.
You'll notice that the index in our DataFrame is the Title column, which
you can tell by how the word Title is slightly lower than the rest of the
columns.
OUT:
<class 'pandas.core.frame.DataFrame'>
.info() provides the essential details about your dataset, such as the
number of rows and columns, the number of non-null values, what type
of data is in each column, and how much memory your DataFrame is
using.
Notice in our movies dataset we have some obvious missing values in
the Revenue and Metascore columns. We'll look at how to handle those in
a bit.
Seeing the datatype quickly is actually quite useful. Imagine you just
imported some JSON and the integers were recorded as strings. You go
to do some arithmetic and find an "unsupported operand" Exception
because you can't do math with strings. Calling .info() will quickly point
out that your column you thought was all integers are actually string
objects.
Another fast and useful attribute is .shape , which outputs just a tuple of
(rows, columns):
movies_df.shape
OUT:
(1000, 11)
Handling duplicates
This dataset does not have duplicate rows, but it is always important to
verify you aren't aggregating duplicate rows.
temp_df = movies_df.append(movies_df)
temp_df.shape
OUT:
(2000, 11)
Using append() will return a copy without affecting the original
DataFrame. We are capturing this copy in temp so we aren't working
with the real data.
Notice call .shape quickly proves our DataFrame rows have doubled.
Now we can try dropping duplicates:
temp_df = temp_df.drop_duplicates()
temp_df.shape
OUT:
(1000, 11)
Just like append() , the drop_duplicates() method will also return a copy of
your DataFrame, but this time with duplicates removed.
Calling .shape confirms we're back to the 1000 rows of our original
dataset.
It's a little verbose to keep assigning DataFrames to the same variable
like in this example. For this reason, pandas has the inplace keyword
argument on many of its methods. Using inplace=True will modify the
DataFrame object in place:
temp_df.drop_duplicates(inplace=True)
temp_df.drop_duplicates(inplace=True, keep=False)
temp_df.shape
OUT:
(0, 11)
Since all rows were duplicates, keep=False dropped them all resulting in
zero rows being left over. If you're wondering why you would want to do
this, one reason is that it allows you to locate all duplicates in your
dataset. When conditional selections are shown below you'll see how to
do that.
Column cleanup
Many times datasets will have verbose column names with symbols,
upper and lowercase words, spaces, and typos. To make selecting data
by column name easier we can spend a little time cleaning up their
names.
movies_df.columns
OUT:
'Metascore'],
dtype='object')
Not only does .columns come in handy if you want to rename columns by
allowing for simple copy and paste, it's also useful if you need to
understand why you are receiving a Key Error when selecting data by
column.
We can use the .rename() method to rename certain or all columns via
a dict . We don't want parentheses, so let's rename those:
movies_df.rename(columns={
}, inplace=True)
movies_df.columns
OUT:
Index(['Rank', 'Genre', 'Description', 'Director', 'Actors', 'Year',
'Runtime',
dtype='object')
'runtime',
movies_df.columns
OUT:
'runtime',
'rating', 'votes', 'revenue_millions', 'metascore'],
dtype='object')
But that's too much work. Instead of just renaming each column
manually we can do a list comprehension:
movies_df.columns
OUT:
'runtime',
dtype='object')
list (and dict ) comprehensions come in handy a lot when working with
pandas and data in general.
It's a good idea to lowercase, remove special characters, and replace
spaces with underscores if you'll be working with a dataset for some
time.
How to work with missing values
When exploring data, you’ll most likely encounter missing or null values,
which are essentially placeholders for non-existent values. Most
commonly you'll see Python's None or NumPy's np.nan , each of which
are handled differently in some situations.
There are two options in dealing with nulls:
movies_df.isnull()
OUT:
reven
desc dir run met
ran gen act yea rat vot ue_m
ripti ect tim asco
k re ors r ing es illion
on or e re
s
Ti
tle
G
ua
rd
ia
ns F F F F F F F F
a a F a a a a a a F
of Fals
l l al l l l l l l al
th e
s s se s s s s s s se
e e e e e e e e e
G
al
ax
y
Pr F F F F F F F F F Fals F
o a a al a a a a a a e al
l l se l l l l l l se
m s s s s s s s s
reven
desc dir run met
ran gen act yea rat vot ue_m
ripti ect tim asco
k re ors r ing es illion
on or e re
s
Ti
tle
et
he e e e e e e e e
us
F F F F F F F F
a a F a a a a a a F
Sp Fals
l l al l l l l l l al
lit e
s s se s s s s s s se
e e e e e e e e
F F F F F F F F
a a F a a a a a a F
Si Fals
l l al l l l l l l al
ng e
s s se s s s s s s se
e e e e e e e e
Su
ici F F F F F F F F
a a F a a a a a a F
de Fals
l l al l l l l l l al
Sq e
s s se s s s s s s se
ua e e e e e e e e
d
Notice isnull() returns a DataFrame where each cell is either True or
False depending on that cell's null status.
To count the number of nulls in each column we use an aggregate
function for summing:
movies_df.isnull().sum()
OUT:
rank 0
genre 0
description 0
director 0
actors 0
year 0
runtime 0
rating 0
votes 0
revenue_millions 128
metascore 64
dtype: int64
movies_df.dropna()
This operation will delete any row with at least a single null value, but it
will return a new DataFrame without altering the original one. You could
specify inplace=True in this method as well.
So in the case of our dataset, this operation would remove 128 rows
where revenue_millions is null and 64 rows where metascore is null. This
obviously seems like a waste since there's perfectly good data in the
other columns of those dropped rows. That's why we'll look at imputation
next.
Other than just dropping rows, you can also drop columns with null
values by setting axis=1 :
movies_df.dropna(axis=1)
Intuition
What's with this axis=1 parameter?
It's not immediately obvious where axis comes from and why you
need it to be 1 for it to affect columns. To see why, just
look at the .shape output:
movies_df.shape
Imputation
Imputation is a conventional feature engineering technique used to keep
valuable data that have null values.
There may be instances where dropping every row with a null value
removes too big a chunk from your dataset, so instead we can impute
that null with another value, usually the mean or the median of that
column.
Let's look at imputing the missing values in the revenue_millions column.
First we'll extract that column into its own variable:
revenue = movies_df['revenue_millions']
OUT:
Title
Prometheus 126.46
Split 138.12
Sing 270.32
revenue_mean = revenue.mean()
revenue_mean
OUT:
82.95637614678897
revenue.fillna(revenue_mean, inplace=True)
We have now replaced all nulls in revenue with the mean of the column.
Notice that by using inplace=True we have actually affected the
original movies_df :
movies_df.isnull().sum()
OUT:
rank 0
genre 0
description 0
director 0
actors 0
year 0
runtime 0
rating 0
votes 0
revenue_millions 0
metascore 64
dtype: int64
Imputing an entire column with the same value like this is a basic
example. It would be a better idea to try a more granular imputation by
Genre or Director.
For example, you would find the mean of the revenue generated in each
genre individually and impute the nulls in each genre with that genre's
mean.
Let's now look at more ways to examine and understand the dataset.
movies_df.describe()
OUT:
movies_df['genre'].describe()
OUT:
count 1000
unique 207
top Action,Adventure,Sci-Fi
freq 50
This tells us that the genre column has 207 unique values, the top value
is Action/Adventure/Sci-Fi, which shows up 50 times (freq).
movies_df['genre'].value_counts().head(10)
OUT:
Action,Adventure,Sci-Fi 50
Drama 48
Comedy,Drama,Romance 35
Comedy 32
Drama,Romance 31
Action,Adventure,Fantasy 27
Comedy,Drama 27
Animation,Adventure,Comedy 27
Comedy,Romance 26
Crime,Drama,Thriller 24
movies_df.corr()
- - - - -
1.00 -
0.26 0.22 0.21 0.28 0.19
rank 000
160 173 955 387
0.25299
186
0 6
5 9 5 6 9
- - - - -
1.00 -
0.26 0.16 0.21 0.41 0.07
year 160
000
490 121 190
0.11756
930
0 2
5 0 9 4 5
- -
1.00 0.39 0.40 0.21
0.22 0.16 0.24783
runtime 173 490
000 221 706
4
197
0 4 2 8
9 0
- -
0.39 1.00 0.51 0.63
0.21 0.21 0.18952
rating 955 121
221 000 153
7
189
4 0 7 7
5 9
- -
0.40 0.51 1.00 0.32
0.28 0.41 0.60794
votes 387 190
706 153 000
1
568
2 7 0 4
6 4
revenue - -
0.24 0.18 0.60 0.13
0.25 0.11 1.00000
_million 299 756
783 952 794
0
332
s 4 7 1 8
6 2
- -
0.21 0.63 0.32 1.00
metasco 0.19 0.07 0.13332
197 189 568 000
re 186 930 8
8 7 4 0
9 5
So looking in the first row, first column we see rank has a perfect
correlation with itself, which is obvious. On the other hand, the
correlation between votes and revenue_millions is 0.6. A little more
interesting.
Examining bivariate relationships comes in handy when you have an
outcome or dependent variable in mind and would like to see the
features most correlated to the increase or decrease of the outcome.
You can visually represent bivariate relationships with scatterplots (seen
below in the plotting section).
By column
You already saw how to extract a column using square brackets like
this:
genre_col = movies_df['genre']
type(genre_col)
OUT:
pandas.core.series.Series
genre_col = movies_df[['genre']]
type(genre_col)
pandas.core.frame.DataFrame
subset.head()
OUT:
genre rating
Title
Guardians of the
Action,Adventure,Sci-Fi 8.1
Galaxy
Title
By rows
For rows, we have two options:
prom = movies_df.loc["Prometheus"]
prom
OUT:
rank 2
genre Adventure,Mystery,Sci-Fi
year 2012
runtime 124
rating 7
votes 485820
revenue_millions 126.46
metascore 65
prom = movies_df.iloc[1]
movie_subset = movies_df.loc['Prometheus':'Sing']
movie_subset = movies_df.iloc[1:4]
movie_subset
OUT:
des reve
ru met
ra cri dire actor ye rat vot nue_
genre nti asc
nk ptio ctor s ar ing es milli
me ore
n ons
T
it
le
P 2 Adv F R No 2 1 7 4 12 6
r entu o i om 0 2 . 8 6. 5
re, ll d i 1 4 0 5 46 .
o Mys o le Ra 2 8 0
m tery, w y pac 2
et Sci- i S e, 0
h Fi n c Lo
e g o ga
u c tt n
l Ma
s u rsh
e all-
s Gr
t een
o ,
t Mi
h cha
e el
o Fa.
ri ..
g
i
n
o
des reve
ru met
ra cri dire actor ye rat vot nue_
genre nti asc
nk ptio ctor s ar ing es milli
me ore
n ons
T
it
le
f
m
a
n
k
i
n
d
,
a
t
e.
..
S 3 Hor T M Ja 2 1 7 1 13 6
pl ror, h . me 0 1 . 5 8. 2
Thri r N s 1 7 3 7 12 .
it ller e i Mc 6 6 0
e g Av 0
g h oy, 6
ir t An
ls S ya
a h Ta
r y ylo
e a r-
k m Joy
i al ,
d a Ha
n n ley
a Lu
p Ric
p har
e ...
d
b
y
a
m
a
n
des reve
ru met
ra cri dire actor ye rat vot nue_
genre nti asc
nk ptio ctor s ar ing es milli
me ore
n ons
T
it
le
w
it
h
a
d
i
a
g
..
.
Si 4 Ani I C Ma 2 1 7 6 27 5
n mati n h tth 0 0 . 0 0. 9
on, a ri ew 1 8 2 5 32 .
g Co c st Mc 6 4 0
med it o Co 5
y,Fa y p na
mil o h ug
y f e he
h L y,
u o Re
m u ese
a r Wi
n d the
o el rsp
i et oo
d n,
a Set
n h
i Ma
m ...
a
ls
,
a
h
u
st
li
n
des reve
ru met
ra cri dire actor ye rat vot nue_
genre nti asc
nk ptio ctor s ar ing es milli
me ore
n ons
T
it
le
g
t
h
e
a.
..
condition.head()
OUT:
Title
Prometheus True
Split False
Sing False
Similar to isnull() , this returns a Series of True and False values: True
for films directed by Ridley Scott and False for ones not directed by him.
We want to filter out all movies not directed by Ridley Scott, in other
words, we don’t want the False films. To return the rows where that
condition is True we have to pass this operation into the DataFrame:
OUT:
rev rati
des me
dir ru rat enu ng_
ra genr cri act ye vot tas
ect nti in e_m cate
nk e pti ors ar es cor
or me g illio gor
on e
ns y
T
i
t
l
e
P 2 A F R N 2 1 7 4 1 6 b
r dv o i o 0 2 . 8 2 5 a
en l d o 1 4 0 5 6. . d
o tur l l m 2 8 4 0
m e, o e i 2 6
e M w y R 0
t ys i a
h ter n S p
e y, g c a
Sc c o c
u i- l t e
s Fi u t ,
e L
s o
t g
o a
t n
h M
e a
o r
r s
i h
g a
i ll
n -
o G
f r
m e
a e
n n
k ,
i M
n i
d c
, h
a a
t e
rev rati
des me
dir ru rat enu ng_
ra genr cri act ye vot tas
ect nti in e_m cate
nk e pti ors ar es cor
or me g illio gor
on e
ns y
T
i
t
l
e
l
e F
.. a
. ..
.
T 1 A A R M 2 1 8 5 2 8 g
h 0 dv n i a 0 4 . 5 2 0 o
3 en a d tt 1 4 0 6 8. . o
e tur s l D 5 0 4 0 d
M e, t e a 9 3
a Dr r y m 7
r a o o
t m n S n
i a, a c ,
Sc u o J
a i- t t e
n Fi b t s
e s
c i
o c
m a
e C
s h
s a
t s
r t
a a
n i
d n
e ,
d K
o r
n i
M s
a t
r e
rev rati
des me
dir ru rat enu ng_
ra genr cri act ye vot tas
ect nti in e_m cate
nk e pti ors ar es cor
or me g illio gor
on e
ns y
T
i
t
l
e
s
n
a
W
f
ii
t
g
e
,
r
K
h
a
i
..
..
.
.
R 3 A I R R 2 1 6 2 1 5 b
o 8 cti n i u 0 4 . 2 0 3 a
8 on 1 d s 1 0 7 1 5. . d
b
,A 2 l s 0 1 2 0
i dv t e e 1 2
n en h y ll 7
H tur c C
o e, e S r
o Dr n c o
a t o w
d m u t e
a r t ,
y C
E a
n t
g e
l B
a l
n a
d n
, c
R h
o e
b tt
i ,
n M
rev rati
des me
dir ru rat enu ng_
ra genr cri act ye vot tas
ect nti in e_m cate
nk e pti ors ar es cor
or me g illio gor
on e
ns y
T
i
t
l
e
a
a
tt
n
h
d
e
h
w
i
s
M
b
a
a
c
n
f
d
a
o
d
f
y
..
..
.
.
A 4 Bi I R D 2 1 7 3 1 7 b
m 7 og n i e 0 5 . 3 3 6 a
1 ra 1 d n 0 7 8 7 0. . d
e ph 9 l z 7 8 1 0
r y, 7 e e 3 3
i Cr 0 y l 5
c im s W
a e, A S a
n Dr m c s
a e o h
G m r t i
a a i t n
n c g
g a t
s , o
t a n
d ,
e e R
r t u
e s
c s
rev rati
des me
dir ru rat enu ng_
ra genr cri act ye vot tas
ect nti in e_m cate
nk e pti ors ar es cor
or me g illio gor
on e
ns y
T
i
t
l
e
e
t
ll
i
C
v
r
e
o
w
w
o
e
r
,
k
C
s
h
t
i
o
w
b
e
r
t
i
e
n
l
g
E
d
ji
..
..
.
.
E 5 A T R C 2 1 6 1 6 5 b
x 1 cti h i h 0 5 . 3 5. 2 a
7 on e d r 1 0 0 7 0 . d
o ,A d l i 4 2 1 0
d dv e e s 9
u en f y ti 9
s tur i a
: e, a S n
G Dr n c B
a t o a
o m l t l
d a e t e
s a ,
a d J
n e o
r e
rev rati
des me
dir ru rat enu ng_
ra genr cri act ye vot tas
ect nti in e_m cate
nk e pti ors ar es cor
or me g illio gor
on e
ns y
T
i
t
l
e
l
M
E
o
d
s
g
e
e
s
r
r
t
i
o
s
n
e
d s
,
K B
u
e
i p
n
n a
K
g g
i
a
s n
i
g
n
s
s
l
t
e
t
y
h
,
e
S
..
..
.
.
reve
des ru met
ra dire acto ye rat vot nue_
genre crip nti asc
nk ctor rs ar ing es milli
tion me ore
ons
T
it
le
A
te
a
m
M
o att
f he
e w
x M
p c
l C
o on
C
r au
h
I e gh
Ad ri
n ven
rs
st
ey 1
te tr , 0
ture o 2 7
a A 1 8 4 18
rs 3 ,Dr p 0 4
v nn 6 . 7 7.9
te 7 am h 1 .
el e 9 6 7 9
ll a,S er 4 0
t H 4
ci- N
a h at 7
Fi ol
r r ha
a
o w
n
u ay
g ,
h Je
a ssi
w ca
o C
r h..
m .
h
o
le
...
T 5 Act W C C 2 1 9 1 53 8
h 5 ion, h h hr 0 5 . 7 3.3 2
Cri e ri ist 0 2 0 9 2 .
e
reve
des ru met
ra dire acto ye rat vot nue_
genre crip nti asc
nk ctor rs ar ing es milli
tion me ore
ons
T
it
le
n
t
h
e
m
e
n
a
c ia
e n
k B
n al
o e,
D w H
st
a n
o
ea
r a th
p
k s Le 1
me, h
t dg 9
K Dra
h
er
er,
8
1
0
ni ma N
e A 6
g ol
J ar
a
h o
n
on
t k Ec
e kh
r art
w ,
r M
e i...
a
k
s
h
a
v
o.
..
I 8 Act A C Le 2 1 8 1 29 7
n 1 ion, t h on 0 4 . 5 2.5 4
Ad h ri ar 1 8 8 8 7 .
c
reve
des ru met
ra dire acto ye rat vot nue_
genre crip nti asc
nk ctor rs ar ing es milli
tion me ore
ons
T
it
le
ie
f,
w
h
o
st
do
e
Di
al
C
s
ap
c
ri
o
o,
r st
Jo
p o
e se
o p
p ven ph 3
r h
ture G 6
ti ,Sci
at er
or
0
2
0
o e N
-Fi do 5
n s ol
n-
e a
Le
c n
vit
r
t,
et
El
s
le
t
n..
h
.
r
o
u
g
h
...
OUT:
reve
des ru met
ra dire acto ye rat vot nue_
genre crip nti asc
nk ctor rs ar ing es milli
tion me ore
ons
T
it
le
F
ol N
lo oo
w m
in i
g R
cl ap
u ac
e e,
P s L
r Ad to R og
o ven th id an 4
m ture e le M 2 8 6
1 7 12
,My o y ar 0 5 5
et 2
ster ri S sh 1
2 .
8
6.
.
h 4 0 46
y,S gi c all 2 2 0
e ci- n ot - 0
u Fi o t G
s f re
m en
a ,
n M
ki ic
n ha
d, el
a Fa
te ...
...
I 3 Ad A C M 2 1 8 1 18 7
n 7 ven te h att 0 6 . 0 7. 4
ture a ri he 1 9 6 4 99 .
te ,Dr m st w 4 7 0
reve
des ru met
ra dire acto ye rat vot nue_
genre crip nti asc
nk ctor rs ar ing es milli
tion me ore
ons
T
it
le
o
f
M
e
c
x
C
pl
on
o
au
re
gh
rs
ey
tr
,
a
o A
v
rs p nn
el
te am h e
th 7
a,S er H
ll ci-
r
N at
4
a o 7
Fi ol ha
r u
a w
g
n ay
h
,
a
Je
w
ss
o
ic
r
a
m
C
h
h..
ol
.
e
...
T 5 Act W C C 2 1 9 1 53 8
h 5 ion, h h hr 0 5 . 7 3. 2
Cri e ri ist 0 2 0 9 32 .
e me, n st ia 8 1 0
D Dra th o n 9
a ma e p B 1
r m h al 6
k e er e,
K n N H
a ol ea
ni c a th
g e n L
reve
des ru met
ra dire acto ye rat vot nue_
genre crip nti asc
nk ctor rs ar ing es milli
tion me ore
ons
T
it
le
k
n
o
w
n
a
ed
s
ge
th
r,
e
A
J
ar
o
h on
k
t E
er
ck
w
ha
re
rt,
a
M
k
i...
s
h
a
v
o.
..
T 6 Dra T C C 2 1 8 9 53 6
h 5 ma, w h hr 0 3 . 1 .0 6
My o ri ist 0 0 5 3 8 .
e ster st st ia 6 1 0
P y,S a o n 5
r ci- g p B 2
es Fi e h al
ti m er e,
g a N H
gi ol ug
e ci a h
a n Ja
n ck
s m
e an
n ,
reve
des ru met
ra dire acto ye rat vot nue_
genre crip nti asc
nk ctor rs ar ing es milli
tion me ore
ons
T
it
le
g
a
g
e
in Sc
c ar
o let
m t
p Jo
et ha
it ns
iv s..
e .
o
n
e-
...
I 8 Act A C L 2 1 8 1 29 7
n 1 ion, th h eo 0 4 . 5 2. 4
Ad ie ri na 1 8 8 8 57 .
c ven f, st rd 0 3 0
e ture w o o 6
p ,Sci h p Di 2
ti -Fi o h C 5
o st er ap
n e N ri
al ol o,
s a Jo
c n se
o ph
r G
p or
o do
ra n-
te L
s ev
e itt
cr ,
et El
reve
des ru met
ra dire acto ye rat vot nue_
genre crip nti asc
nk ctor rs ar ing es milli
tion me ore
ons
T
it
le
s
th
r
le
o
n..
u
.
g
h
...
Using the isin() method we could make this more concise though:
Scott'])].head()
OUT:
reve
des ru met
ra dire acto ye rat vot nue_
genre crip nti asc
nk ctor rs ar ing es milli
tion me ore
ons
T
it
le
P 2 Ad F R N 2 1 7 4 12 6
r ven ol id oo 0 2 . 8 6. 5
ture lo le m 1 4 0 5 46 .
o ,My w y i 2 8 0
m ster in S R 2
et y,S g c ap 0
h ci- cl ot ac
reve
des ru met
ra dire acto ye rat vot nue_
genre crip nti asc
nk ctor rs ar ing es milli
tion me ore
ons
T
it
le
u
e e,
s L
to og
th an
e M
o ar
ri sh
gi all
e n -
u Fi o t G
s f re
m en
a ,
n M
ki ic
n ha
d, el
a Fa
te ...
...
I 3 Ad A C M 2 1 8 1 18 7
n 7 ven te h att 0 6 . 0 7. 4
ture a ri he 1 9 6 4 99 .
te ,Dr m st w 4 7 0
rs am o o M 7
te a,S f p c 4
ll ci- e h C 7
a Fi x er on
r pl N au
o ol gh
re a ey
rs n ,
tr A
a nn
v e
el H
th at
r ha
reve
des ru met
ra dire acto ye rat vot nue_
genre crip nti asc
nk ctor rs ar ing es milli
tion me ore
ons
T
it
le
o
u
w
g
ay
h
,
a
Je
w
ss
o
ic
r
a
m
C
h
h..
ol
.
e
...
T 5 Act W C C 2 1 9 1 53 8
h 5 ion, h h hr 0 5 . 7 3. 2
Cri e ri ist 0 2 0 9 32 .
e me, n st ia 8 1 0
D Dra th o n 9
a ma e p B 1
r m h al 6
k e er e,
K n N H
a ol ea
ni c a th
g e n L
h k ed
t n ge
o r,
w A
n ar
a on
s E
th ck
e ha
J rt,
o M
k i...
er
w
reve
des ru met
ra dire acto ye rat vot nue_
genre crip nti asc
nk ctor rs ar ing es milli
tion me ore
ons
T
it
le
re
a
k
s
h
a
v
o.
..
T 6 Dra T C C 2 1 8 9 53 6
h 5 ma, w h hr 0 3 . 1 .0 6
My o ri ist 0 0 5 3 8 .
e ster st st ia 6 1 0
P y,S a o n 5
r ci- g p B 2
es Fi e h al
ti m er e,
g a N H
gi ol ug
e ci a h
a n Ja
n ck
s m
e an
n ,
g Sc
a ar
g let
e t
in Jo
c ha
o ns
m s..
p .
et
it
iv
e
o
reve
des ru met
ra dire acto ye rat vot nue_
genre crip nti asc
nk ctor rs ar ing es milli
tion me ore
ons
T
it
le
n
e-
...
A
th
ie
L
f,
eo
w
na
h
rd
o
o
st
Di
e
C
al C
ap
s h
I ri
c ri
n Act
o st
o, 1
c ion, Jo 5
r o 2 7
Ad se 1 8 8 29
e 8 p p 0 4
ven ph 4 . 3 2.
p 1 o h 1 .
ture G 8 8 6 57
ti ra er 0 0
,Sci or 2
te N
o -Fi do 5
s ol
n n-
e a
L
cr n
ev
et
itt
s
,
th
El
r
le
o
n..
u
.
g
h
...
Let's say we want all movies that were released between 2005 and
2010, have a rating above 8.0, but made below the 25th percentile in
revenue.
Here's how we could do all of that:
movies_df[
movies_df['revenue_millions'].quantile(0.25))
OUT:
reve
des ru met
ra genr dire acto ye rat vot nue_
crip nti asc
nk e ctor rs ar ing es milli
tion me ore
ons
T
it
l
e
3 4 Co T R Aa 2 1 8 2 6.5 6
I 3 me w aj mi 0 7 . 3 2 7
1 dy, o ku r 0 0 4 8 .
d Dr fr m Kh 9 7 0
i am ie ar an, 8
o a n Hi M 9
t d ra ad
s s ni ha
a va
r n,
e M
s on
e a
a Si
r ng
c h,
h Sh
reve
des ru met
ra genr dire acto ye rat vot nue_
crip nti asc
nk e ctor rs ar ing es milli
tion me ore
ons
T
it
l
e
i
n
g
f
o
r
t
ar
h
ma
ei
n
r
Jo
l
shi
o
n
g
l
o
st
...
T 4 Dr I Fl Ul 2 1 8 2 11. 8
h 7 am n or ric 0 3 . 7 28 9
7 a,T 1 ia h 0 7 5 8 .
e hril 9 n M 6 1 0
L ler 8 H üh 0
i 4 en e, 3
v E ck M
e a el art
s st vo ina
B n Ge
o e D de
f rl on ck,
O i ne Se
t n, rs ba
h a m sti
e n ar an
a ck Ko
r g ch,
s e Ul
n ...
reve
des ru met
ra genr dire acto ye rat vot nue_
crip nti asc
nk e ctor rs ar ing es milli
tion me ore
ons
T
it
l
e
t
o
f
t
h
e
s
e
c
r
et
p
o.
..
I 7 Dr T D Lu 2 1 8 9 6.8 8
n 1 am w en bn 0 3 . 2 6 0
4 a, i is a 1 1 2 8 .
c My n Vi Az 0 6 0
e ster s lle ab 3
n y, j ne al,
d Wa o uv M
i r u e éli
e r ssa
n Dé
s e sor
y me
t au
o x-
t Po
h uli
e n,
M M
i axi
d m.
d ..
le
E
a
reve
des ru met
ra genr dire acto ye rat vot nue_
crip nti asc
nk e ctor rs ar ing es milli
tion me ore
ons
T
it
l
e
st
t
o
d
is
c
o
v
e
r
t.
..
T 9 Dr A A Da 2 1 8 1 1.2 4
a 9 am n a rsh 0 6 . 0 0 2
2 a,F ei m eel 0 5 5 2 .
a am g ir Sa 7 6 0
r ily, h K far 9
e Mu t- ha y, 7
Z sic y n Aa
a e mi
m a r
r- Kh
e o an,
e l Ta
n d na
P b y
a o Ch
r y he
is da,
t Sa
h c...
o
u
g
h
t
t
o
reve
des ru met
ra genr dire acto ye rat vot nue_
crip nti asc
nk e ctor rs ar ing es milli
tion me ore
ons
T
it
l
e
b
e
a
la
z
y
...
Applying functions
It is possible to iterate over a DataFrame or Series as you would with a
list, but doing so — especially on large datasets — is very slow.
def rating_function(x):
if x >= 8.0:
return "good"
else:
return "bad"
Now we want to send the entire rating column through this function,
which is what apply() does:
movies_df["rating_category"] = movies_df["rating"].apply(rating_function)
movies_df.head(2)
OUT:
reve rati
des me
dir ru rat nue ng_
ra genr cri act ye vot tas
ect nti in _mi cate
nk e pti ors ar es cor
or me g llio gor
on e
ns y
T
i
t
l
e
G 1 A A J C 2 1 8 7 3 7 g
u cti a h 0 2 . 5 3 6 o
on g m r 1 1 1 7 3. . o
a ,A r e i 4 0 1 0 d
r dv o s s 7 3
d en u P 4
i tur p G r
a e, o u a
n Sc f n t
i- i n t
s Fi n ,
o t V
f e i
t r n
h g
reve rati
des me
dir ru rat nue ng_
ra genr cri act ye vot tas
ect nti in _mi cate
nk e pti ors ar es cor
or me g llio gor
on e
ns y
T
i
t
l
e
D
i
a
e
l
s
a
e
c
l
ti
,
c
B
c
r
r
a
i
d
e m
l
G i
e
a n
y
a
l l
a C
s
x o
a
o
y r
p
e
e
f
r
o
,
r
Z
c
o
e
e
d
S
..
.
.
.
.
P 2 A F R N 2 1 7 4 1 6 b
r dv o i o 0 2 . 8 2 5 a
en ll d o 1 4 0 5 6. . d
o tur o l m 2 8 4 0
m e, w e i 2 6
e M i y R 0
t ys n a
reve rati
des me
dir ru rat nue ng_
ra genr cri act ye vot tas
ect nti in _mi cate
nk e pti ors ar es cor
or me g llio gor
on e
ns y
T
i
t
l
e
h ter g S p
e y, c c a
Sc l o c
u i- u t e
s Fi e t ,
s L
t o
o g
t a
h n
e
o M
r a
i r
g s
i h
n a
o l
f l
m -
a G
n r
k e
i e
n n
d ,
, M
a i
t c
e h
.. a
. e
l
F
a
.
.
reve rati
des me
dir ru rat nue ng_
ra genr cri act ye vot tas
ect nti in _mi cate
nk e pti ors ar es cor
or me g llio gor
on e
ns y
T
i
t
l
e
The .apply() method passes every value in the rating column through
the rating_function and then returns a new Series. This Series is then
assigned to a new column called rating_category .
You can also use anonymous functions as well. This lambda function
achieves the same result as rating_function :
movies_df.head(2)
OUT:
reve rati
des me
dir ru rat nue ng_
ra genr cri act ye vot tas
ect nti in _mi cate
nk e pti ors ar es cor
or me g llio gor
on e
ns y
T
i
t
l
e
G 1 A A J C 2 1 8 7 3 7 g
reve rati
des me
dir ru rat nue ng_
ra genr cri act ye vot tas
ect nti in _mi cate
nk e pti ors ar es cor
or me g llio gor
on e
ns y
T
i
t
l
e
u cti a h 0 2 . 5 3 6 o
a on g m r 1 1 1 7 3. . o
,A r e i 4 0 1 0 d
r dv o s s 7 3
d en u P 4
i tur p G r
a e, o u a
n Sc f n t
s i- i n t
Fi n ,
o t V
f e i
t r n
h g
e a D
G l i
a e
a c s
l ti e
a c l
x c ,
y r B
i r
m a
i d
n l
a e
l y
s
a C
r o
e o
f p
o e
r r
c ,
e Z
reve rati
des me
dir ru rat nue ng_
ra genr cri act ye vot tas
ect nti in _mi cate
nk e pti ors ar es cor
or me g llio gor
on e
ns y
T
i
t
l
e
o
e
d
S
..
.
.
.
.
P 2 A F R N 2 1 7 4 1 6 b
r dv o i o 0 2 . 8 2 5 a
en ll d o 1 4 0 5 6. . d
o
tur o l m 2 8 4 0
m e, w e i 2 6
e M i y R 0
t ys n a
h ter g S p
e y, c c a
Sc l o c
u i- u t e
s Fi e t ,
s L
t o
o g
t a
h n
e
o M
r a
i r
g s
i h
n a
o l
f l
m -
a G
n r
k e
reve rati
des me
dir ru rat nue ng_
ra genr cri act ye vot tas
ect nti in _mi cate
nk e pti ors ar es cor
or me g llio gor
on e
ns y
T
i
t
l
e
e
n
,
i M
n i
d c
, h
a a
t e
e l
.. F
. a
.
.
.
Overall, using apply() will be much faster than iterating manually over
rows because pandas is utilizing vectorization.
Brief Plotting
Another great thing about pandas is that it integrates with Matplotlib, so
you get the ability to plot directly off DataFrames and Series. To get
started we need to import Matplotlib ( pip install matplotlib ):
import matplotlib.pyplot as plt
Plotting Tip
For categorical variables utilize Bar Charts* and Boxplots.
Let's plot the relationship between ratings and revenue. All we need to
do is call .plot() on movies_df with some info about how to construct the
plot:
RESULT:
What's with the semicolon? It's not a syntax error, just a way to hide
the <matplotlib.axes._subplots.AxesSubplot at 0x26613b5cc18> output when
plotting in Jupyter notebooks.
If we want to plot a simple Histogram based on a single column, we can
call plot on a column:
movies_df['rating'].plot(kind='hist', title='Rating');
RESULT:
Do you remember the .describe() example at the beginning of this
tutorial? Well, there's a graphical representation of the interquartile
range, called the Boxplot. Let's recall what describe() gives us on the
ratings column:
movies_df['rating'].describe()
OUT:
count 1000.000000
mean 6.723200
std 0.945429
min 1.900000
25% 6.200000
50% 6.800000
75% 7.400000
max 9.000000
movies_df['rating'].plot(kind="box");
RESULT:
Source: *Flowing Data*
By combining categorical and continuous data, we can create a Boxplot
of revenue that is grouped by the Rating Category we created above:
movies_df.boxplot(column='revenue_millions', by='rating_category');
RESULT:
That's the general idea of plotting with pandas. There's too many plots
to mention, so definitely take a look at the plot() docs here for more
information on what it can do.
Wrapping up
Exploring, cleaning, transforming, and visualization data with pandas in
Python is an essential skill in data science. Just cleaning wrangling data
is 80% of your job as a Data Scientist. After a few projects and some
practice, you should be very comfortable with most of the basics.
Resources
Applied Data Science with Python — Coursera
Covers an intro to Python, Visualization, Machine Learning,
Text Mining, and Social Network Analysis in Python. Also
provides many challenging quizzes and assignments to further
enhance your learning.