0% found this document useful (0 votes)
16 views86 pages

Pandas Learndatasci

Uploaded by

prudhviraj.m74
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
16 views86 pages

Pandas Learndatasci

Uploaded by

prudhviraj.m74
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 86

Python Pandas

Tutorial: A Complete
Introduction for
Beginners
Learn some of the most important pandas features for
exploring, cleaning, transforming, visualizing, and learning
from data.

You should already know:


 Python fundamentals – learn interactively on dataquest.io
The pandas package is the most important tool at the disposal of Data
Scientists and Analysts working in Python today. The powerful machine
learning and glamorous visualization tools may get all the attention, but
pandas is the backbone of most data projects.

[pandas] is derived from the term "panel data", an


econometrics term for data sets that include observations
over multiple time periods for the same individuals. —
Wikipedia
If you're thinking about data science as a career, then it is imperative
that one of the first things you do is learn pandas. In this post, we will go
over the essential bits of information about pandas, including how to
install it, its uses, and how it works with other common Python data
analysis packages such as matplotlib and scikit-learn.
Article Resources
 iPython notebook and data available on GitHub
Other articles in this series
 Applied Introduction to NumPy

What's Pandas for?


Pandas has so many uses that it might make sense to list the things it
can't do instead of what it can do.

This tool is essentially your data’s home. Through pandas, you get
acquainted with your data by cleaning, transforming, and analyzing it.

For example, say you want to explore a dataset stored in a CSV on your
computer. Pandas will extract the data from that CSV into a DataFrame
— a table, basically — then let you do things like:

 Calculate statistics and answer questions about the data, like


o What's the average, median, max, or min of each column?
o Does column A correlate with column B?
o What does the distribution of data in column C look like?
 Clean the data by doing things like removing missing values and filtering
rows or columns by some criteria
 Visualize the data with help from Matplotlib. Plot bars, lines, histograms,
bubbles, and more.
 Store the cleaned, transformed data back into a CSV, other file or database
Before you jump into the modeling or the complex visualizations you
need to have a good understanding of the nature of your dataset and
pandas is the best avenue through which to do that.
How does pandas fit into the
data science toolkit?
Not only is the pandas library a central component of the data science
toolkit but it is used in conjunction with other libraries in that collection.

Pandas is built on top of the NumPy package, meaning a lot of the


structure of NumPy is used or replicated in Pandas. Data in pandas is
often used to feed statistical analysis in SciPy, plotting functions
from Matplotlib, and machine learning algorithms in Scikit-learn.
Jupyter Notebooks offer a good environment for using pandas to do data
exploration and modeling, but pandas can also be used in text editors
just as easily.

Jupyter Notebooks give us the ability to execute code in a particular cell


as opposed to running the entire file. This saves a lot of time when
working with large datasets and complex transformations. Notebooks
also provide an easy way to visualize pandas’ DataFrames and plots. As
a matter of fact, this article was created entirely in a Jupyter Notebook.
When should you start using
pandas?
If you do not have any experience coding in Python, then you should
stay away from learning pandas until you do. You don’t have to be at the
level of the software engineer, but you should be adept at the basics,
such as lists, tuples, dictionaries, functions, and iterations. Also, I’d also
recommend familiarizing yourself with NumPy due to the similarities
mentioned above.
If you're looking for a good place to learn Python, Python for
Everybody on Coursera is great (and Free).

Test your Python knowledge 🐍


Complete the quiz to see your results.
1 of 9: What statement does this code print?
fruit = 'apple'

if fruit == 'Apple':
print('The fruit is an apple')
elif fruit == 'Orange':
print('The fruit is an orange')
else:
print('The fruit is unidentified')
'apple'
'The fruit is an orange'
'The fruit is an apple'
'The fruit is unidentified'
I don't know yet

PrevNext
SCORE

Moreover, for those of you looking to do a data science bootcamp or


some other accelerated data science education program, it's highly
recommended you start learning pandas on your own before you start
the program.

Even though accelerated programs teach you pandas, better skills


beforehand means you'll be able to maximize time for learning and
mastering the more complicated material.
Pandas First Steps
Install and import
Pandas is an easy package to install. Open up your terminal program
(for Mac users) or command line (for PC users) and install it using either
of the following commands:

conda install pandas

OR

pip install pandas

Alternatively, if you're currently viewing this article in a Jupyter notebook


you can run this cell:

!pip install pandas

The ! at the beginning runs cells as if they were in a terminal.


To import pandas we usually import it with a shorter name since it's
used so much:

import pandas as pd

Now to the basic components of pandas.

Core components of pandas:


Series and DataFrames
The primary two components of pandas are the Series and DataFrame .
A Series is essentially a column, and a DataFrame is a multi-dimensional
table made up of a collection of Series.
DataFrames and Series are quite similar in that many operations that
you can do with one you can do with the other, such as filling in null
values and calculating the mean.

You'll see how these components work when we start working with data
below.

Creating DataFrames from scratch


Creating DataFrames right in Python is good to know and quite useful
when testing new methods and functions you find in the pandas docs.

There are many ways to create a DataFrame from scratch, but a great
option is to just use a simple dict .
Let's say we have a fruit stand that sells apples and oranges. We want
to have a column for each fruit and a row for each customer purchase.
To organize this as a dictionary for pandas we could do something like:

data = {

'apples': [3, 2, 0, 1],

'oranges': [0, 3, 7, 2]

}
And then pass it to the pandas DataFrame constructor:

purchases = pd.DataFrame(data)

purchases

OUT:

apple
oranges
s

0 3 0

1 2 3

2 0 7

3 1 2

How did that work?


Each (key, value) item in data corresponds to a column in the resulting
DataFrame.
The Index of this DataFrame was given to us on creation as the
numbers 0-3, but we could also create our own when we initialize the
DataFrame.
Let's have customer names as our index:

purchases = pd.DataFrame(data, index=['June', 'Robert', 'Lily', 'David'])

purchases

OUT:

apples oranges

June 3 0

Rober 2 3
apples oranges

Lily 0 7

David 1 2

So now we could locate a customer's order by using their name:

purchases.loc['June']

OUT:

apples 3

oranges 0

Name: June, dtype: int64

There's more on locating and extracting data from the DataFrame later,
but now you should be able to create a DataFrame with any random
data to learn on.

Let's move on to some quick methods for creating DataFrames from


various other sources.

Want to learn more?


See Best Data Science Courses
View

How to read in data


It’s quite simple to load data from various file formats into a DataFrame.
In the following examples we'll keep using our apples and oranges data,
but this time it's coming from various files.

Reading data from CSVs


With CSV files all you need is a single line to load in the data:

df = pd.read_csv('purchases.csv')

df

OUT:

Unnamed: 0 apples oranges

0 June 3 0

1 Robert 2 3

2 Lily 0 7

3 David 1 2

CSVs don't have indexes like our DataFrames, so all we need to do is


just designate the index_col when reading:

df = pd.read_csv('purchases.csv', index_col=0)

df

OUT:

apples oranges

June 3 0

Rober
2 3
t
apples oranges

Lily 0 7

David 1 2

Here we're setting the index to be column zero.

You'll find that most CSVs won't ever have an index column and so
usually you don't have to worry about this step.

Reading data from JSON


If you have a JSON file — which is essentially a stored Python dict —
pandas can read this just as easily:

df = pd.read_json('purchases.json')

df

OUT:

apples oranges

David 1 2

June 3 0

Lily 0 7

Rober
2 3
t

Notice this time our index came with us correctly since using JSON
allowed indexes to work through nesting. Feel free to
open data_file.json in a notepad so you can see how it works.
Pandas will try to figure out how to create a DataFrame by analyzing
structure of your JSON, and sometimes it doesn't get it right. Often you'll
need to set the orient keyword argument depending on the structure, so
check out read_json docs about that argument to see which orientation
you're using.

Reading data from a SQL database


If you’re working with data from a SQL database you need to first
establish a connection using an appropriate Python library, then pass a
query to pandas. Here we'll use SQLite to demonstrate.

First, we need pysqlite3 installed, so run this command in your terminal:


pip install pysqlite3

Or run this cell if you're in a notebook:

!pip install pysqlite3

sqlite3 is used to create a connection to a database which we can then


use to generate a DataFrame through a SELECT query.
So first we'll make a connection to a SQLite database file:

import sqlite3

con = sqlite3.connect("database.db")

SQL Tip
If you have data in PostgreSQL, MySQL, or some other SQL
server, you'll need to obtain the right Python library to make
a connection. For example, psycopg2 (link) is a commonly used
library for making connections to PostgreSQL. Furthermore, you
would make a connection to a database URI instead of a file
like we did here with SQLite.
For a great course on SQL check out The Complete SQL
Bootcamp on Udemy
In this SQLite database we have a table called purchases, and our index
is in a column called "index".
By passing a SELECT query and our con , we can read from
the purchases table:

df = pd.read_sql_query("SELECT * FROM purchases", con)

df

OUT:

index apples oranges

0 June 3 0

1 Robert 2 3

2 Lily 0 7

3 David 1 2

Just like with CSVs, we could pass index_col='index' , but we can also set
an index after-the-fact:

df = df.set_index('index')

df

OUT:

apples oranges

index

June 3 0

Rober
2 3
t

Lily 0 7
apples oranges

index

David 1 2

In fact, we could use set_index() on any DataFrame using any column


at any time. Indexing Series and DataFrames is a very common task,
and the different ways of doing it is worth remembering.

Converting back to a CSV, JSON, or SQL


So after extensive work on cleaning your data, you’re now ready to save
it as a file of your choice. Similar to the ways we read in data, pandas
provides intuitive commands to save it:

df.to_csv('new_purchases.csv')

df.to_json('new_purchases.json')

df.to_sql('new_purchases', con)

When we save JSON and CSV files, all we have to input into those
functions is our desired filename with the appropriate file extension.
With SQL, we’re not creating a new file but instead inserting a new table
into the database using our con variable from before.
Let's move on to importing some real-world data and detailing a few of
the operations you'll be using a lot.

Most important DataFrame


operations
DataFrames possess hundreds of methods and other operations that
are crucial to any analysis. As a beginner, you should know the
operations that perform simple transformations of your data and those
that provide fundamental statistical analysis.

Let's load in the IMDB movies dataset to begin:

movies_df = pd.read_csv("IMDB-Movie-Data.csv", index_col="Title")

We're loading this dataset from a CSV and designating the movie titles
to be our index.

Viewing your data


The first thing to do when opening a new dataset is print out a few rows
to keep as a visual reference. We accomplish this with .head() :

movies_df.head()

OUT:

Ru Re
nti ven
Des Dir Ra Me
Ra Acto Ye me Vo ue
Genre crip ecto tin tasc
nk rs ar (Mi tes (Mi
tion r g ore
nut llio
es) ns)

T
it
le

G 1 Acti A J Ch 2 1 8 7 3 7
u on, g a ris 0 2 . 5 3 6
Adv r m Pra 1 1 1 7 3 .
a entu o e tt, 4 0 . 0
r re,S u s Vi 7 1
di ci- p G n 4 3
a Fi o u Di
n f n ese
s in n l,
te Br
of
Ru Re
nti ven
Des Dir Ra Me
Ra Acto Ye me Vo ue
Genre crip ecto tin tasc
nk rs ar (Mi tes (Mi
tion r g ore
nut llio
es) ns)

T
it
le

r
g
al
a
ct
ic
t
cr adl
h i ey
e m Co
G in op
al al er,
a s Zo
ar e
x e S...
y f
o
rc
e
d
...

P 2 Adv F R No 2 1 7 4 1 6
r entu ol i om 0 2 . 8 2 5
re, lo d i 1 4 0 5 6 .
o Mys w le Ra 2 8 . 0
m tery, in y pac 2 4
et Sci- g S e, 0 6
h Fi cl c Lo
e u o ga
u e tt n
s Ma
s to rsh
th all-
e Gr
o een
ri ,
gi Mi
n cha
Ru Re
nti ven
Des Dir Ra Me
Ra Acto Ye me Vo ue
Genre crip ecto tin tasc
nk rs ar (Mi tes (Mi
tion r g ore
nut llio
es) ns)

T
it
le

o
f
m
a
n el
ki Fa.
n ..
d,
a
te
...

S 3 Hor T M Ja 2 1 7 1 1 6
pl ror, h . me 0 1 . 5 3 2
Thri re N s 1 7 3 7 8 .
it ller e i Mc 6 6 . 0
gi g Av 0 1
rl h oy, 6 2
s t An
ar S ya
e h Ta
ki y ylo
d a r-
n m Joy
a al ,
p a Ha
p n ley
e Lu
d Ric
b har
y ...
a
m
a
n
w
it
h
Ru Re
nti ven
Des Dir Ra Me
Ra Acto Ye me Vo ue
Genre crip ecto tin tasc
nk rs ar (Mi tes (Mi
tion r g ore
nut llio
es) ns)

T
it
le

a
di
a
g.
..

I
n
a
ci
ty
o Ma
f tth
h ew
C
u Mc
h
m Co
ri
a na
st
n ug
Ani o
oi he 2
mati p 6
Si d y,R 2 7 5
on, h 1 7 0
a ees 0 0 9
n 4 Co
ni
e
e 1
0 . 5
. .
g med L 8 2 4
m Wi 6 3 0
y,Fa o 5
al the 2
mily u
s, rsp
r
a oo
d
h n,
el
u Set
et
st h
li Ma
n ...
g
th
e
a.
..

S 5 Acti A D Wi 2 1 6 3 3 4
Ru Re
nti ven
Des Dir Ra Me
Ra Acto Ye me Vo ue
Genre crip ecto tin tasc
nk rs ar (Mi tes (Mi
tion r g ore
nut llio
es) ns)

T
it
le

s
e
cr
et
g
o
v
ll
er
Sm
n
ith,
m
ui Jar
e
ed
ci nt a
Let
d on, a v
o, 9 2
e Adv g i
Ma 0 3 5 0
entu e d 2 .
S re,F n A
rgo 1
3 2
7 . .
q t 6 2 0 0
anta c y
u Ro 7 2
sy y e
bbi
a re r
e,
d cr
Vi
ui
ola
ts
D..
s
.
o
m
e
o
f
th
...

.head() outputs the first five rows of your DataFrame by default, but we
could also pass a number as well: movies_df.head(10) would output the
top ten rows, for example.
To see the last five rows use .tail() . tail() also accepts a number,
and in this case we printing the bottom two rows.:
movies_df.tail(2)

OUT:

Ru Rev
nti enu
Des Dir Ra Met
Ra Genr Acto Ye me Vo e
crip ecto tin asc
nk e rs ar (Mi tes (Mi
tion r g ore
nut llio
es) ns)

T
i
t
l
e

A
pa
ir
of Ad
fri am
S en Pal
e ds ly,
e S T.J
a m c .
r ba ot Mi
c Adv
rk A lle 2 4 2
9 entu 5 N
h 9 re,C
o r r, 0 9
.
8
a
2
n m Th 1 3 8 .
9 ome 6 N
P a st om 4 1 0
dy
a m ro as
is n Mi
r si g ddl
t o edi
y n tch
to ,S
re h...
u
ni
...

N 1 Co A B Ke 2 8 5 1 1 1
i 0 med st ar vin 0 7 . 2 9 1
0 y,Fa uf ry Sp 1 3 4 . .
n 0 mil fy S ac 6 3 6 0
e y,Fa b o ey, 5 4
L ntas us n Je
Ru Rev
nti enu
Des Dir Ra Met
Ra Genr Acto Ye me Vo e
crip ecto tin asc
nk e rs ar (Mi tes (Mi
tion r g ore
nut llio
es) ns)

T
i
t
l
e

in
es
s
m nni
an fer
fi Ga
n rne
i ds n r,
hi e Ro
v
y m nf bbi
e se el e
s lf d A
tr me
ap ll,
pe Ch
d ...
in
s..
.

Typically when we load in a dataset, we like to view the first five or so


rows to see what's under the hood. Here we can see the names of each
column, the index, and examples of values in each row.

You'll notice that the index in our DataFrame is the Title column, which
you can tell by how the word Title is slightly lower than the rest of the
columns.

Getting info about your data


.info() should be one of the very first commands you run after loading
your data:
movies_df.info()

OUT:

<class 'pandas.core.frame.DataFrame'>

Index: 1000 entries, Guardians of the Galaxy to Nine Lives

Data columns (total 11 columns):

Rank 1000 non-null int64

Genre 1000 non-null object

Description 1000 non-null object

Director 1000 non-null object

Actors 1000 non-null object

Year 1000 non-null int64


Runtime (Minutes) 1000 non-null int64

Rating 1000 non-null float64

Votes 1000 non-null int64

Revenue (Millions) 872 non-null float64

Metascore 936 non-null float64

dtypes: float64(3), int64(4), object(4)

memory usage: 93.8+ KB

.info() provides the essential details about your dataset, such as the
number of rows and columns, the number of non-null values, what type
of data is in each column, and how much memory your DataFrame is
using.
Notice in our movies dataset we have some obvious missing values in
the Revenue and Metascore columns. We'll look at how to handle those in
a bit.
Seeing the datatype quickly is actually quite useful. Imagine you just
imported some JSON and the integers were recorded as strings. You go
to do some arithmetic and find an "unsupported operand" Exception
because you can't do math with strings. Calling .info() will quickly point
out that your column you thought was all integers are actually string
objects.
Another fast and useful attribute is .shape , which outputs just a tuple of
(rows, columns):

movies_df.shape

OUT:

(1000, 11)

Note that .shape has no parentheses and is a simple tuple of format


(rows, columns). So we have 1000 rows and 11 columns in our movies
DataFrame.
You'll be going to .shape a lot when cleaning and transforming data. For
example, you might filter some rows based on some criteria and then
want to know quickly how many rows were removed.

Handling duplicates
This dataset does not have duplicate rows, but it is always important to
verify you aren't aggregating duplicate rows.

To demonstrate, let's simply just double up our movies DataFrame by


appending it to itself:

temp_df = movies_df.append(movies_df)

temp_df.shape

OUT:

(2000, 11)
Using append() will return a copy without affecting the original
DataFrame. We are capturing this copy in temp so we aren't working
with the real data.
Notice call .shape quickly proves our DataFrame rows have doubled.
Now we can try dropping duplicates:

temp_df = temp_df.drop_duplicates()

temp_df.shape

OUT:

(1000, 11)

Just like append() , the drop_duplicates() method will also return a copy of
your DataFrame, but this time with duplicates removed.
Calling .shape confirms we're back to the 1000 rows of our original
dataset.
It's a little verbose to keep assigning DataFrames to the same variable
like in this example. For this reason, pandas has the inplace keyword
argument on many of its methods. Using inplace=True will modify the
DataFrame object in place:

temp_df.drop_duplicates(inplace=True)

Now our temp_df will have the transformed data automatically.


Another important argument for drop_duplicates() is keep , which has
three possible options:
 first: (default) Drop duplicates except for the first occurrence.
 last: Drop duplicates except for the last occurrence.
 False: Drop all duplicates.
Since we didn't define the keep arugment in the previous example it was
defaulted to first . This means that if two rows are the same pandas will
drop the second row and keep the first row. Using last has the opposite
effect: the first row is dropped.
, on the other hand, will drop all duplicates. If two rows are the
keep

same then both will be dropped. Watch what happens to temp_df :

temp_df = movies_df.append(movies_df) # make a new copy

temp_df.drop_duplicates(inplace=True, keep=False)

temp_df.shape

OUT:

(0, 11)

Since all rows were duplicates, keep=False dropped them all resulting in
zero rows being left over. If you're wondering why you would want to do
this, one reason is that it allows you to locate all duplicates in your
dataset. When conditional selections are shown below you'll see how to
do that.

Column cleanup
Many times datasets will have verbose column names with symbols,
upper and lowercase words, spaces, and typos. To make selecting data
by column name easier we can spend a little time cleaning up their
names.

Here's how to print the column names of our dataset:

movies_df.columns

OUT:

Index(['Rank', 'Genre', 'Description', 'Director', 'Actors', 'Year',


'Runtime (Minutes)', 'Rating', 'Votes', 'Revenue (Millions)',

'Metascore'],

dtype='object')

Not only does .columns come in handy if you want to rename columns by
allowing for simple copy and paste, it's also useful if you need to
understand why you are receiving a Key Error when selecting data by
column.
We can use the .rename() method to rename certain or all columns via
a dict . We don't want parentheses, so let's rename those:

movies_df.rename(columns={

'Runtime (Minutes)': 'Runtime',

'Revenue (Millions)': 'Revenue_millions'

}, inplace=True)

movies_df.columns

OUT:
Index(['Rank', 'Genre', 'Description', 'Director', 'Actors', 'Year',

'Runtime',

'Rating', 'Votes', 'Revenue_millions', 'Metascore'],

dtype='object')

Excellent. But what if we want to lowercase all names? Instead of


using .rename() we could also set a list of names to the columns like so:

movies_df.columns = ['rank', 'genre', 'description', 'director', 'actors', 'year',

'runtime',

'rating', 'votes', 'revenue_millions', 'metascore']

movies_df.columns

OUT:

Index(['rank', 'genre', 'description', 'director', 'actors', 'year',

'runtime',
'rating', 'votes', 'revenue_millions', 'metascore'],

dtype='object')

But that's too much work. Instead of just renaming each column
manually we can do a list comprehension:

movies_df.columns = [col.lower() for col in movies_df]

movies_df.columns

OUT:

Index(['rank', 'genre', 'description', 'director', 'actors', 'year',

'runtime',

'rating', 'votes', 'revenue_millions', 'metascore'],

dtype='object')

list (and dict ) comprehensions come in handy a lot when working with
pandas and data in general.
It's a good idea to lowercase, remove special characters, and replace
spaces with underscores if you'll be working with a dataset for some
time.
How to work with missing values
When exploring data, you’ll most likely encounter missing or null values,
which are essentially placeholders for non-existent values. Most
commonly you'll see Python's None or NumPy's np.nan , each of which
are handled differently in some situations.
There are two options in dealing with nulls:

1. Get rid of rows or columns with nulls


2. Replace nulls with non-null values, a technique known as imputation
Let's calculate to total number of nulls in each column of our dataset.
The first step is to check which cells in our DataFrame are null:

movies_df.isnull()

OUT:

reven
desc dir run met
ran gen act yea rat vot ue_m
ripti ect tim asco
k re ors r ing es illion
on or e re
s

Ti
tle

G
ua
rd
ia
ns F F F F F F F F
a a F a a a a a a F
of Fals
l l al l l l l l l al
th e
s s se s s s s s s se
e e e e e e e e e
G
al
ax
y

Pr F F F F F F F F F Fals F
o a a al a a a a a a e al
l l se l l l l l l se
m s s s s s s s s
reven
desc dir run met
ran gen act yea rat vot ue_m
ripti ect tim asco
k re ors r ing es illion
on or e re
s

Ti
tle

et
he e e e e e e e e
us

F F F F F F F F
a a F a a a a a a F
Sp Fals
l l al l l l l l l al
lit e
s s se s s s s s s se
e e e e e e e e

F F F F F F F F
a a F a a a a a a F
Si Fals
l l al l l l l l l al
ng e
s s se s s s s s s se
e e e e e e e e

Su
ici F F F F F F F F
a a F a a a a a a F
de Fals
l l al l l l l l l al
Sq e
s s se s s s s s s se
ua e e e e e e e e
d
Notice isnull() returns a DataFrame where each cell is either True or
False depending on that cell's null status.
To count the number of nulls in each column we use an aggregate
function for summing:

movies_df.isnull().sum()

OUT:

rank 0
genre 0

description 0

director 0

actors 0

year 0

runtime 0

rating 0

votes 0

revenue_millions 128

metascore 64
dtype: int64

.isnull() just by iteself isn't very useful, and is usually used in


conjunction with other methods, like sum() .
We can see now that our data has 128 missing values
for revenue_millions and 64 missing values for metascore .

Removing null values


Data Scientists and Analysts regularly face the dilemma of dropping or
imputing null values, and is a decision that requires intimate knowledge
of your data and its context. Overall, removing null data is only
suggested if you have a small amount of missing data.

Remove nulls is pretty simple:

movies_df.dropna()

This operation will delete any row with at least a single null value, but it
will return a new DataFrame without altering the original one. You could
specify inplace=True in this method as well.
So in the case of our dataset, this operation would remove 128 rows
where revenue_millions is null and 64 rows where metascore is null. This
obviously seems like a waste since there's perfectly good data in the
other columns of those dropped rows. That's why we'll look at imputation
next.
Other than just dropping rows, you can also drop columns with null
values by setting axis=1 :

movies_df.dropna(axis=1)

In our dataset, this operation would drop


the revenue_millions and metascore columns

Intuition
What's with this axis=1 parameter?
It's not immediately obvious where axis comes from and why you
need it to be 1 for it to affect columns. To see why, just
look at the .shape output:
movies_df.shape

Out: (1000, 11)

As we learned above, this is a tuple that represents the shape


of the DataFrame, i.e. 1000 rows and 11 columns. Note that
the rows are at index zero of this tuple and columns are
at index one of this tuple. This is why axis=1 affects columns.
This comes from NumPy, and is a great example of why learning
NumPy is worth your time.

Imputation
Imputation is a conventional feature engineering technique used to keep
valuable data that have null values.

There may be instances where dropping every row with a null value
removes too big a chunk from your dataset, so instead we can impute
that null with another value, usually the mean or the median of that
column.
Let's look at imputing the missing values in the revenue_millions column.
First we'll extract that column into its own variable:

revenue = movies_df['revenue_millions']

Using square brackets is the general way we select columns in a


DataFrame.

If you remember back to when we created DataFrames from scratch, the


keys of the dict ended up as column names. Now when we select
columns of a DataFrame, we use brackets just like if we were accessing
a Python dictionary.
revenue now contains a Series:
revenue.head()

OUT:

Title

Guardians of the Galaxy 333.13

Prometheus 126.46

Split 138.12

Sing 270.32

Suicide Squad 325.02

Name: revenue_millions, dtype: float64

Slightly different formatting than a DataFrame, but we still have


our Title index.
We'll impute the missing values of revenue using the mean. Here's the
mean value:

revenue_mean = revenue.mean()
revenue_mean

OUT:

82.95637614678897

With the mean, let's fill the nulls using fillna() :

revenue.fillna(revenue_mean, inplace=True)

We have now replaced all nulls in revenue with the mean of the column.
Notice that by using inplace=True we have actually affected the
original movies_df :

movies_df.isnull().sum()

OUT:

rank 0

genre 0

description 0

director 0

actors 0
year 0

runtime 0

rating 0

votes 0

revenue_millions 0

metascore 64

dtype: int64

Imputing an entire column with the same value like this is a basic
example. It would be a better idea to try a more granular imputation by
Genre or Director.

For example, you would find the mean of the revenue generated in each
genre individually and impute the nulls in each genre with that genre's
mean.

Let's now look at more ways to examine and understand the dataset.

Understanding your variables


Using describe() on an entire DataFrame we can get a summary of the
distribution of continuous variables:

movies_df.describe()

OUT:

runtim revenue_ metasc


rank year rating votes
e millions ore

co 1000. 1000. 1000. 1000. 1.000 936.


1000.00
un 0000 0000 0000 0000 000e+
0000
0000
t 00 00 00 00 03 00

m 500.5 2012. 113.1 1.698 58.9


6.723 82.9563
ea 0000 7830 7200
200
083e+
76
8504
n 0 00 0 05 3

288.8 1.887 17.1


st 3.205 18.81 0.945 96.4120
1943 626e+ 9475
d 962 0908 429 43
6 05 7

2006. 6.100 11.0


mi 1.000 66.00 1.900 0.00000
0000 000e+ 0000
n 000 0000 000 0
00 01 0

250.7 2010. 100.0 3.630 47.0


25 6.200 17.4425
5000 0000 0000 900e+ 0000
% 000 00
0 00 0 04 0

500.5 2014. 111.0 1.107 59.5


50 6.800 60.3750
0000 0000 0000 990e+ 0000
% 000 00
0 00 0 05 0

750.2 2016. 123.0 2.399 72.0


75 7.400 99.1775
5000 0000 0000 098e+ 0000
% 000 00
0 00 0 05 0

1000. 2016. 191.0 1.791 100.


m 9.000 936.630
0000 0000 0000 916e+ 0000
ax 000 000
00 00 0 06 00

Understanding which numbers are continuous also comes in handy


when thinking about the type of plot to use to represent your data
visually.
.describe() can also be used on a categorical variable to get the count of
rows, unique count of categories, top category, and freq of top category:

movies_df['genre'].describe()

OUT:

count 1000

unique 207

top Action,Adventure,Sci-Fi

freq 50

Name: genre, dtype: object

This tells us that the genre column has 207 unique values, the top value
is Action/Adventure/Sci-Fi, which shows up 50 times (freq).

.value_counts() can tell us the frequency of all values in a column:

movies_df['genre'].value_counts().head(10)

OUT:

Action,Adventure,Sci-Fi 50
Drama 48

Comedy,Drama,Romance 35

Comedy 32

Drama,Romance 31

Action,Adventure,Fantasy 27

Comedy,Drama 27

Animation,Adventure,Comedy 27

Comedy,Romance 26

Crime,Drama,Thriller 24

Name: genre, dtype: int64


Relationships between continuous variables
By using the correlation method .corr() we can generate the
relationship between each continuous variable:

movies_df.corr()

runti revenue_ metas


rank year rating votes
me millions core

- - - - -
1.00 -
0.26 0.22 0.21 0.28 0.19
rank 000
160 173 955 387
0.25299
186
0 6
5 9 5 6 9

- - - - -
1.00 -
0.26 0.16 0.21 0.41 0.07
year 160
000
490 121 190
0.11756
930
0 2
5 0 9 4 5

- -
1.00 0.39 0.40 0.21
0.22 0.16 0.24783
runtime 173 490
000 221 706
4
197
0 4 2 8
9 0

- -
0.39 1.00 0.51 0.63
0.21 0.21 0.18952
rating 955 121
221 000 153
7
189
4 0 7 7
5 9

- -
0.40 0.51 1.00 0.32
0.28 0.41 0.60794
votes 387 190
706 153 000
1
568
2 7 0 4
6 4

revenue - -
0.24 0.18 0.60 0.13
0.25 0.11 1.00000
_million 299 756
783 952 794
0
332
s 4 7 1 8
6 2

- -
0.21 0.63 0.32 1.00
metasco 0.19 0.07 0.13332
197 189 568 000
re 186 930 8
8 7 4 0
9 5

Correlation tables are a numerical representation of the bivariate


relationships in the dataset.
Positive numbers indicate a positive correlation — one goes up the
other goes up — and negative numbers represent an inverse correlation
— one goes up the other goes down. 1.0 indicates a perfect correlation.

So looking in the first row, first column we see rank has a perfect
correlation with itself, which is obvious. On the other hand, the
correlation between votes and revenue_millions is 0.6. A little more
interesting.
Examining bivariate relationships comes in handy when you have an
outcome or dependent variable in mind and would like to see the
features most correlated to the increase or decrease of the outcome.
You can visually represent bivariate relationships with scatterplots (seen
below in the plotting section).

For a deeper look into data summarizations check out Essential


Statistics for Data Science .
Let's now look more at manipulating DataFrames.

DataFrame slicing, selecting, extracting


Up until now we've focused on some basic summaries of our data.
We've learned about simple column extraction using single brackets,
and we imputed null values in a column using fillna() . Below are the
other methods of slicing, selecting, and extracting you'll need to use
constantly.
It's important to note that, although many methods are the same,
DataFrames and Series have different attributes, so you'll need be sure
to know which type you are working with or else you will receive attribute
errors.

Let's look at working with columns first.

By column
You already saw how to extract a column using square brackets like
this:
genre_col = movies_df['genre']

type(genre_col)

OUT:

pandas.core.series.Series

This will return a Series. To extract a column as a DataFrame, you need


to pass a list of column names. In our case that's just a single column:

genre_col = movies_df[['genre']]

type(genre_col)

pandas.core.frame.DataFrame

Since it's just a list, adding another column name is easy:

subset = movies_df[['genre', 'rating']]

subset.head()

OUT:

genre rating

Title

Guardians of the
Action,Adventure,Sci-Fi 8.1
Galaxy

Prometheus Adventure,Mystery,Sci-Fi 7.0

Split Horror,Thriller 7.3


genre rating

Title

Sing Animation,Comedy,Family 7.2

Suicide Squad Action,Adventure,Fantasy 6.2

Now we'll look at getting data by rows.

By rows
For rows, we have two options:

.loc - locates by name


 .iloc- locates by numerical index
Remember that we are still indexed by movie Title, so to use .loc we
give it the Title of a movie:

prom = movies_df.loc["Prometheus"]

prom

OUT:

rank 2

genre Adventure,Mystery,Sci-Fi

description Following clues to the origin of mankind, a te...

director Ridley Scott


actors Noomi Rapace, Logan Marshall-Green, Michael Fa...

year 2012

runtime 124

rating 7

votes 485820

revenue_millions 126.46

metascore 65

Name: Prometheus, dtype: object

On the other hand, with iloc we give it the numerical index of


Prometheus:

prom = movies_df.iloc[1]

loc and iloc can be thought of as similar to Python list slicing. To


show this even further, let's select multiple rows.
How would you do it with a list? In Python, just slice with brackets
like example_list[1:4] . It's works the same way in pandas:

movie_subset = movies_df.loc['Prometheus':'Sing']

movie_subset = movies_df.iloc[1:4]

movie_subset

OUT:

des reve
ru met
ra cri dire actor ye rat vot nue_
genre nti asc
nk ptio ctor s ar ing es milli
me ore
n ons

T
it
le

P 2 Adv F R No 2 1 7 4 12 6
r entu o i om 0 2 . 8 6. 5
re, ll d i 1 4 0 5 46 .
o Mys o le Ra 2 8 0
m tery, w y pac 2
et Sci- i S e, 0
h Fi n c Lo
e g o ga
u c tt n
l Ma
s u rsh
e all-
s Gr
t een
o ,
t Mi
h cha
e el
o Fa.
ri ..
g
i
n
o
des reve
ru met
ra cri dire actor ye rat vot nue_
genre nti asc
nk ptio ctor s ar ing es milli
me ore
n ons

T
it
le

f
m
a
n
k
i
n
d
,
a
t
e.
..

S 3 Hor T M Ja 2 1 7 1 13 6
pl ror, h . me 0 1 . 5 8. 2
Thri r N s 1 7 3 7 12 .
it ller e i Mc 6 6 0
e g Av 0
g h oy, 6
ir t An
ls S ya
a h Ta
r y ylo
e a r-
k m Joy
i al ,
d a Ha
n n ley
a Lu
p Ric
p har
e ...
d
b
y
a
m
a
n
des reve
ru met
ra cri dire actor ye rat vot nue_
genre nti asc
nk ptio ctor s ar ing es milli
me ore
n ons

T
it
le

w
it
h
a
d
i
a
g
..
.

Si 4 Ani I C Ma 2 1 7 6 27 5
n mati n h tth 0 0 . 0 0. 9
on, a ri ew 1 8 2 5 32 .
g Co c st Mc 6 4 0
med it o Co 5
y,Fa y p na
mil o h ug
y f e he
h L y,
u o Re
m u ese
a r Wi
n d the
o el rsp
i et oo
d n,
a Set
n h
i Ma
m ...
a
ls
,
a
h
u
st
li
n
des reve
ru met
ra cri dire actor ye rat vot nue_
genre nti asc
nk ptio ctor s ar ing es milli
me ore
n ons

T
it
le

g
t
h
e
a.
..

One important distinction between using .loc and .iloc to select


multiple rows is that .loc includes the movie Sing in the result, but when
using .iloc we're getting rows 1:4 but the movie at index 4 (Suicide
Squad) is not included.
Slicing with .iloc follows the same rules as slicing with lists, the object
at the index at the end is not included.
Conditional selections
We’ve gone over how to select columns and rows, but what if we want to
make a conditional selection?

For example, what if we want to filter our movies DataFrame to show


only films directed by Ridley Scott or films with a rating greater than or
equal to 8.0?

To do that, we take a column from the DataFrame and apply a Boolean


condition to it. Here's an example of a Boolean condition:

condition = (movies_df['director'] == "Ridley Scott")

condition.head()

OUT:
Title

Guardians of the Galaxy False

Prometheus True

Split False

Sing False

Suicide Squad False

Name: director, dtype: bool

Similar to isnull() , this returns a Series of True and False values: True
for films directed by Ridley Scott and False for ones not directed by him.
We want to filter out all movies not directed by Ridley Scott, in other
words, we don’t want the False films. To return the rows where that
condition is True we have to pass this operation into the DataFrame:

movies_df[movies_df['director'] == "Ridley Scott"]

OUT:
rev rati
des me
dir ru rat enu ng_
ra genr cri act ye vot tas
ect nti in e_m cate
nk e pti ors ar es cor
or me g illio gor
on e
ns y

T
i
t
l
e

P 2 A F R N 2 1 7 4 1 6 b
r dv o i o 0 2 . 8 2 5 a
en l d o 1 4 0 5 6. . d
o tur l l m 2 8 4 0
m e, o e i 2 6
e M w y R 0
t ys i a
h ter n S p
e y, g c a
Sc c o c
u i- l t e
s Fi u t ,
e L
s o
t g
o a
t n
h M
e a
o r
r s
i h
g a
i ll
n -
o G
f r
m e
a e
n n
k ,
i M
n i
d c
, h
a a
t e
rev rati
des me
dir ru rat enu ng_
ra genr cri act ye vot tas
ect nti in e_m cate
nk e pti ors ar es cor
or me g illio gor
on e
ns y

T
i
t
l
e

l
e F
.. a
. ..
.

T 1 A A R M 2 1 8 5 2 8 g
h 0 dv n i a 0 4 . 5 2 0 o
3 en a d tt 1 4 0 6 8. . o
e tur s l D 5 0 4 0 d
M e, t e a 9 3
a Dr r y m 7
r a o o
t m n S n
i a, a c ,
Sc u o J
a i- t t e
n Fi b t s
e s
c i
o c
m a
e C
s h
s a
t s
r t
a a
n i
d n
e ,
d K
o r
n i
M s
a t
r e
rev rati
des me
dir ru rat enu ng_
ra genr cri act ye vot tas
ect nti in e_m cate
nk e pti ors ar es cor
or me g illio gor
on e
ns y

T
i
t
l
e

s
n
a
W
f
ii
t
g
e
,
r
K
h
a
i
..
..
.
.

R 3 A I R R 2 1 6 2 1 5 b
o 8 cti n i u 0 4 . 2 0 3 a
8 on 1 d s 1 0 7 1 5. . d
b
,A 2 l s 0 1 2 0
i dv t e e 1 2
n en h y ll 7
H tur c C
o e, e S r
o Dr n c o
a t o w
d m u t e
a r t ,
y C
E a
n t
g e
l B
a l
n a
d n
, c
R h
o e
b tt
i ,
n M
rev rati
des me
dir ru rat enu ng_
ra genr cri act ye vot tas
ect nti in e_m cate
nk e pti ors ar es cor
or me g illio gor
on e
ns y

T
i
t
l
e

a
a
tt
n
h
d
e
h
w
i
s
M
b
a
a
c
n
f
d
a
o
d
f
y
..
..
.
.

A 4 Bi I R D 2 1 7 3 1 7 b
m 7 og n i e 0 5 . 3 3 6 a
1 ra 1 d n 0 7 8 7 0. . d
e ph 9 l z 7 8 1 0
r y, 7 e e 3 3
i Cr 0 y l 5
c im s W
a e, A S a
n Dr m c s
a e o h
G m r t i
a a i t n
n c g
g a t
s , o
t a n
d ,
e e R
r t u
e s
c s
rev rati
des me
dir ru rat enu ng_
ra genr cri act ye vot tas
ect nti in e_m cate
nk e pti ors ar es cor
or me g illio gor
on e
ns y

T
i
t
l
e

e
t
ll
i
C
v
r
e
o
w
w
o
e
r
,
k
C
s
h
t
i
o
w
b
e
r
t
i
e
n
l
g
E
d
ji
..
..
.
.

E 5 A T R C 2 1 6 1 6 5 b
x 1 cti h i h 0 5 . 3 5. 2 a
7 on e d r 1 0 0 7 0 . d
o ,A d l i 4 2 1 0
d dv e e s 9
u en f y ti 9
s tur i a
: e, a S n
G Dr n c B
a t o a
o m l t l
d a e t e
s a ,
a d J
n e o
r e
rev rati
des me
dir ru rat enu ng_
ra genr cri act ye vot tas
ect nti in e_m cate
nk e pti ors ar es cor
or me g illio gor
on e
ns y

T
i
t
l
e

l
M
E
o
d
s
g
e
e
s
r
r
t
i
o
s
n
e
d s
,
K B
u
e
i p
n
n a
K
g g
i
a
s n
i
g
n
s
s
l
t
e
t
y
h
,
e
S
..
..
.
.

You can get used to looking at these conditionals by reading it like:

Select movies_df where movies_df director equals Ridley Scott.


Let's look at conditional selections using numerical values by filtering
the DataFrame by ratings:

movies_df[movies_df['rating'] >= 8.6].head(3)


OUT:

reve
des ru met
ra dire acto ye rat vot nue_
genre crip nti asc
nk ctor rs ar ing es milli
tion me ore
ons

T
it
le

A
te
a
m
M
o att
f he
e w
x M
p c
l C
o on
C
r au
h
I e gh
Ad ri
n ven
rs
st
ey 1
te tr , 0
ture o 2 7
a A 1 8 4 18
rs 3 ,Dr p 0 4
v nn 6 . 7 7.9
te 7 am h 1 .
el e 9 6 7 9
ll a,S er 4 0
t H 4
ci- N
a h at 7
Fi ol
r r ha
a
o w
n
u ay
g ,
h Je
a ssi
w ca
o C
r h..
m .
h
o
le
...

T 5 Act W C C 2 1 9 1 53 8
h 5 ion, h h hr 0 5 . 7 3.3 2
Cri e ri ist 0 2 0 9 2 .
e
reve
des ru met
ra dire acto ye rat vot nue_
genre crip nti asc
nk ctor rs ar ing es milli
tion me ore
ons

T
it
le

n
t
h
e
m
e
n
a
c ia
e n
k B
n al
o e,
D w H
st
a n
o
ea
r a th
p
k s Le 1
me, h
t dg 9
K Dra
h
er
er,
8
1
0
ni ma N
e A 6
g ol
J ar
a
h o
n
on
t k Ec
e kh
r art
w ,
r M
e i...
a
k
s
h
a
v
o.
..

I 8 Act A C Le 2 1 8 1 29 7
n 1 ion, t h on 0 4 . 5 2.5 4
Ad h ri ar 1 8 8 8 7 .
c
reve
des ru met
ra dire acto ye rat vot nue_
genre crip nti asc
nk ctor rs ar ing es milli
tion me ore
ons

T
it
le

ie
f,
w
h
o
st
do
e
Di
al
C
s
ap
c
ri
o
o,
r st
Jo
p o
e se
o p
p ven ph 3
r h
ture G 6
ti ,Sci
at er
or
0
2
0
o e N
-Fi do 5
n s ol
n-
e a
Le
c n
vit
r
t,
et
El
s
le
t
n..
h
.
r
o
u
g
h
...

We can make some richer conditionals by using logical operators | for


"or" and & for "and".
Let's filter the the DataFrame to show only movies by Christopher Nolan
OR Ridley Scott:
movies_df[(movies_df['director'] == 'Christopher Nolan') |

(movies_df['director'] == 'Ridley Scott')].head()

OUT:

reve
des ru met
ra dire acto ye rat vot nue_
genre crip nti asc
nk ctor rs ar ing es milli
tion me ore
ons

T
it
le

F
ol N
lo oo
w m
in i
g R
cl ap
u ac
e e,
P s L
r Ad to R og
o ven th id an 4
m ture e le M 2 8 6
1 7 12
,My o y ar 0 5 5
et 2
ster ri S sh 1
2 .
8
6.
.
h 4 0 46
y,S gi c all 2 2 0
e ci- n ot - 0
u Fi o t G
s f re
m en
a ,
n M
ki ic
n ha
d, el
a Fa
te ...
...

I 3 Ad A C M 2 1 8 1 18 7
n 7 ven te h att 0 6 . 0 7. 4
ture a ri he 1 9 6 4 99 .
te ,Dr m st w 4 7 0
reve
des ru met
ra dire acto ye rat vot nue_
genre crip nti asc
nk ctor rs ar ing es milli
tion me ore
ons

T
it
le

o
f
M
e
c
x
C
pl
on
o
au
re
gh
rs
ey
tr
,
a
o A
v
rs p nn
el
te am h e
th 7
a,S er H
ll ci-
r
N at
4
a o 7
Fi ol ha
r u
a w
g
n ay
h
,
a
Je
w
ss
o
ic
r
a
m
C
h
h..
ol
.
e
...

T 5 Act W C C 2 1 9 1 53 8
h 5 ion, h h hr 0 5 . 7 3. 2
Cri e ri ist 0 2 0 9 32 .
e me, n st ia 8 1 0
D Dra th o n 9
a ma e p B 1
r m h al 6
k e er e,
K n N H
a ol ea
ni c a th
g e n L
reve
des ru met
ra dire acto ye rat vot nue_
genre crip nti asc
nk ctor rs ar ing es milli
tion me ore
ons

T
it
le

k
n
o
w
n
a
ed
s
ge
th
r,
e
A
J
ar
o
h on
k
t E
er
ck
w
ha
re
rt,
a
M
k
i...
s
h
a
v
o.
..

T 6 Dra T C C 2 1 8 9 53 6
h 5 ma, w h hr 0 3 . 1 .0 6
My o ri ist 0 0 5 3 8 .
e ster st st ia 6 1 0
P y,S a o n 5
r ci- g p B 2
es Fi e h al
ti m er e,
g a N H
gi ol ug
e ci a h
a n Ja
n ck
s m
e an
n ,
reve
des ru met
ra dire acto ye rat vot nue_
genre crip nti asc
nk ctor rs ar ing es milli
tion me ore
ons

T
it
le

g
a
g
e
in Sc
c ar
o let
m t
p Jo
et ha
it ns
iv s..
e .
o
n
e-
...

I 8 Act A C L 2 1 8 1 29 7
n 1 ion, th h eo 0 4 . 5 2. 4
Ad ie ri na 1 8 8 8 57 .
c ven f, st rd 0 3 0
e ture w o o 6
p ,Sci h p Di 2
ti -Fi o h C 5
o st er ap
n e N ri
al ol o,
s a Jo
c n se
o ph
r G
p or
o do
ra n-
te L
s ev
e itt
cr ,
et El
reve
des ru met
ra dire acto ye rat vot nue_
genre crip nti asc
nk ctor rs ar ing es milli
tion me ore
ons

T
it
le

s
th
r
le
o
n..
u
.
g
h
...

We need to make sure to group evaluations with parentheses so Python


knows how to evaluate the conditional.

Using the isin() method we could make this more concise though:

movies_df[movies_df['director'].isin(['Christopher Nolan', 'Ridley

Scott'])].head()

OUT:

reve
des ru met
ra dire acto ye rat vot nue_
genre crip nti asc
nk ctor rs ar ing es milli
tion me ore
ons

T
it
le

P 2 Ad F R N 2 1 7 4 12 6
r ven ol id oo 0 2 . 8 6. 5
ture lo le m 1 4 0 5 46 .
o ,My w y i 2 8 0
m ster in S R 2
et y,S g c ap 0
h ci- cl ot ac
reve
des ru met
ra dire acto ye rat vot nue_
genre crip nti asc
nk ctor rs ar ing es milli
tion me ore
ons

T
it
le

u
e e,
s L
to og
th an
e M
o ar
ri sh
gi all
e n -
u Fi o t G
s f re
m en
a ,
n M
ki ic
n ha
d, el
a Fa
te ...
...

I 3 Ad A C M 2 1 8 1 18 7
n 7 ven te h att 0 6 . 0 7. 4
ture a ri he 1 9 6 4 99 .
te ,Dr m st w 4 7 0
rs am o o M 7
te a,S f p c 4
ll ci- e h C 7
a Fi x er on
r pl N au
o ol gh
re a ey
rs n ,
tr A
a nn
v e
el H
th at
r ha
reve
des ru met
ra dire acto ye rat vot nue_
genre crip nti asc
nk ctor rs ar ing es milli
tion me ore
ons

T
it
le

o
u
w
g
ay
h
,
a
Je
w
ss
o
ic
r
a
m
C
h
h..
ol
.
e
...

T 5 Act W C C 2 1 9 1 53 8
h 5 ion, h h hr 0 5 . 7 3. 2
Cri e ri ist 0 2 0 9 32 .
e me, n st ia 8 1 0
D Dra th o n 9
a ma e p B 1
r m h al 6
k e er e,
K n N H
a ol ea
ni c a th
g e n L
h k ed
t n ge
o r,
w A
n ar
a on
s E
th ck
e ha
J rt,
o M
k i...
er
w
reve
des ru met
ra dire acto ye rat vot nue_
genre crip nti asc
nk ctor rs ar ing es milli
tion me ore
ons

T
it
le

re
a
k
s
h
a
v
o.
..

T 6 Dra T C C 2 1 8 9 53 6
h 5 ma, w h hr 0 3 . 1 .0 6
My o ri ist 0 0 5 3 8 .
e ster st st ia 6 1 0
P y,S a o n 5
r ci- g p B 2
es Fi e h al
ti m er e,
g a N H
gi ol ug
e ci a h
a n Ja
n ck
s m
e an
n ,
g Sc
a ar
g let
e t
in Jo
c ha
o ns
m s..
p .
et
it
iv
e
o
reve
des ru met
ra dire acto ye rat vot nue_
genre crip nti asc
nk ctor rs ar ing es milli
tion me ore
ons

T
it
le

n
e-
...

A
th
ie
L
f,
eo
w
na
h
rd
o
o
st
Di
e
C
al C
ap
s h
I ri
c ri
n Act
o st
o, 1
c ion, Jo 5
r o 2 7
Ad se 1 8 8 29
e 8 p p 0 4
ven ph 4 . 3 2.
p 1 o h 1 .
ture G 8 8 6 57
ti ra er 0 0
,Sci or 2
te N
o -Fi do 5
s ol
n n-
e a
L
cr n
ev
et
itt
s
,
th
El
r
le
o
n..
u
.
g
h
...

Let's say we want all movies that were released between 2005 and
2010, have a rating above 8.0, but made below the 25th percentile in
revenue.
Here's how we could do all of that:

movies_df[

((movies_df['year'] >= 2005) & (movies_df['year'] <= 2010))

& (movies_df['rating'] > 8.0)

& (movies_df['revenue_millions'] <

movies_df['revenue_millions'].quantile(0.25))

OUT:

reve
des ru met
ra genr dire acto ye rat vot nue_
crip nti asc
nk e ctor rs ar ing es milli
tion me ore
ons

T
it
l
e

3 4 Co T R Aa 2 1 8 2 6.5 6
I 3 me w aj mi 0 7 . 3 2 7
1 dy, o ku r 0 0 4 8 .
d Dr fr m Kh 9 7 0
i am ie ar an, 8
o a n Hi M 9
t d ra ad
s s ni ha
a va
r n,
e M
s on
e a
a Si
r ng
c h,
h Sh
reve
des ru met
ra genr dire acto ye rat vot nue_
crip nti asc
nk e ctor rs ar ing es milli
tion me ore
ons

T
it
l
e

i
n
g
f
o
r
t
ar
h
ma
ei
n
r
Jo
l
shi
o
n
g
l
o
st
...

T 4 Dr I Fl Ul 2 1 8 2 11. 8
h 7 am n or ric 0 3 . 7 28 9
7 a,T 1 ia h 0 7 5 8 .
e hril 9 n M 6 1 0
L ler 8 H üh 0
i 4 en e, 3
v E ck M
e a el art
s st vo ina
B n Ge
o e D de
f rl on ck,
O i ne Se
t n, rs ba
h a m sti
e n ar an
a ck Ko
r g ch,
s e Ul
n ...
reve
des ru met
ra genr dire acto ye rat vot nue_
crip nti asc
nk e ctor rs ar ing es milli
tion me ore
ons

T
it
l
e

t
o
f
t
h
e
s
e
c
r
et
p
o.
..

I 7 Dr T D Lu 2 1 8 9 6.8 8
n 1 am w en bn 0 3 . 2 6 0
4 a, i is a 1 1 2 8 .
c My n Vi Az 0 6 0
e ster s lle ab 3
n y, j ne al,
d Wa o uv M
i r u e éli
e r ssa
n Dé
s e sor
y me
t au
o x-
t Po
h uli
e n,
M M
i axi
d m.
d ..
le
E
a
reve
des ru met
ra genr dire acto ye rat vot nue_
crip nti asc
nk e ctor rs ar ing es milli
tion me ore
ons

T
it
l
e

st
t
o
d
is
c
o
v
e
r
t.
..

T 9 Dr A A Da 2 1 8 1 1.2 4
a 9 am n a rsh 0 6 . 0 0 2
2 a,F ei m eel 0 5 5 2 .
a am g ir Sa 7 6 0
r ily, h K far 9
e Mu t- ha y, 7
Z sic y n Aa
a e mi
m a r
r- Kh
e o an,
e l Ta
n d na
P b y
a o Ch
r y he
is da,
t Sa
h c...
o
u
g
h
t
t
o
reve
des ru met
ra genr dire acto ye rat vot nue_
crip nti asc
nk e ctor rs ar ing es milli
tion me ore
ons

T
it
l
e

b
e
a
la
z
y
...

If you recall up when we used .describe() the 25th percentile for


revenue was about 17.4, and we can access this value directly by using
the quantile() method with a float of 0.25.
So here we have only four movies that match that criteria.

Applying functions
It is possible to iterate over a DataFrame or Series as you would with a
list, but doing so — especially on large datasets — is very slow.

An efficient alternative is to apply() a function to the dataset. For


example, we could use a function to convert movies with an 8.0 or
greater to a string value of "good" and the rest to "bad" and use this
transformed values to create a new column.
First we would create a function that, when given a rating, determines if
it's good or bad:

def rating_function(x):

if x >= 8.0:
return "good"

else:

return "bad"

Now we want to send the entire rating column through this function,
which is what apply() does:

movies_df["rating_category"] = movies_df["rating"].apply(rating_function)

movies_df.head(2)

OUT:

reve rati
des me
dir ru rat nue ng_
ra genr cri act ye vot tas
ect nti in _mi cate
nk e pti ors ar es cor
or me g llio gor
on e
ns y

T
i
t
l
e

G 1 A A J C 2 1 8 7 3 7 g
u cti a h 0 2 . 5 3 6 o
on g m r 1 1 1 7 3. . o
a ,A r e i 4 0 1 0 d
r dv o s s 7 3
d en u P 4
i tur p G r
a e, o u a
n Sc f n t
i- i n t
s Fi n ,
o t V
f e i
t r n
h g
reve rati
des me
dir ru rat nue ng_
ra genr cri act ye vot tas
ect nti in _mi cate
nk e pti ors ar es cor
or me g llio gor
on e
ns y

T
i
t
l
e

D
i
a
e
l
s
a
e
c
l
ti
,
c
B
c
r
r
a
i
d
e m
l
G i
e
a n
y
a
l l
a C
s
x o
a
o
y r
p
e
e
f
r
o
,
r
Z
c
o
e
e
d
S
..
.
.
.
.

P 2 A F R N 2 1 7 4 1 6 b
r dv o i o 0 2 . 8 2 5 a
en ll d o 1 4 0 5 6. . d
o tur o l m 2 8 4 0
m e, w e i 2 6
e M i y R 0
t ys n a
reve rati
des me
dir ru rat nue ng_
ra genr cri act ye vot tas
ect nti in _mi cate
nk e pti ors ar es cor
or me g llio gor
on e
ns y

T
i
t
l
e

h ter g S p
e y, c c a
Sc l o c
u i- u t e
s Fi e t ,
s L
t o
o g
t a
h n
e
o M
r a
i r
g s
i h
n a
o l
f l
m -
a G
n r
k e
i e
n n
d ,
, M
a i
t c
e h
.. a
. e
l
F
a
.
.
reve rati
des me
dir ru rat nue ng_
ra genr cri act ye vot tas
ect nti in _mi cate
nk e pti ors ar es cor
or me g llio gor
on e
ns y

T
i
t
l
e

The .apply() method passes every value in the rating column through
the rating_function and then returns a new Series. This Series is then
assigned to a new column called rating_category .
You can also use anonymous functions as well. This lambda function
achieves the same result as rating_function :

movies_df["rating_category"] = movies_df["rating"].apply(lambda x: 'good' if

x >= 8.0 else 'bad')

movies_df.head(2)

OUT:

reve rati
des me
dir ru rat nue ng_
ra genr cri act ye vot tas
ect nti in _mi cate
nk e pti ors ar es cor
or me g llio gor
on e
ns y

T
i
t
l
e

G 1 A A J C 2 1 8 7 3 7 g
reve rati
des me
dir ru rat nue ng_
ra genr cri act ye vot tas
ect nti in _mi cate
nk e pti ors ar es cor
or me g llio gor
on e
ns y

T
i
t
l
e

u cti a h 0 2 . 5 3 6 o
a on g m r 1 1 1 7 3. . o
,A r e i 4 0 1 0 d
r dv o s s 7 3
d en u P 4
i tur p G r
a e, o u a
n Sc f n t
s i- i n t
Fi n ,
o t V
f e i
t r n
h g
e a D
G l i
a e
a c s
l ti e
a c l
x c ,
y r B
i r
m a
i d
n l
a e
l y
s
a C
r o
e o
f p
o e
r r
c ,
e Z
reve rati
des me
dir ru rat nue ng_
ra genr cri act ye vot tas
ect nti in _mi cate
nk e pti ors ar es cor
or me g llio gor
on e
ns y

T
i
t
l
e

o
e
d
S
..
.
.
.
.

P 2 A F R N 2 1 7 4 1 6 b
r dv o i o 0 2 . 8 2 5 a
en ll d o 1 4 0 5 6. . d
o
tur o l m 2 8 4 0
m e, w e i 2 6
e M i y R 0
t ys n a
h ter g S p
e y, c c a
Sc l o c
u i- u t e
s Fi e t ,
s L
t o
o g
t a
h n
e
o M
r a
i r
g s
i h
n a
o l
f l
m -
a G
n r
k e
reve rati
des me
dir ru rat nue ng_
ra genr cri act ye vot tas
ect nti in _mi cate
nk e pti ors ar es cor
or me g llio gor
on e
ns y

T
i
t
l
e

e
n
,
i M
n i
d c
, h
a a
t e
e l
.. F
. a
.
.
.

Overall, using apply() will be much faster than iterating manually over
rows because pandas is utilizing vectorization.

Vectorization: a style of computer programming where


operations are applied to whole arrays instead of individual
elements —Wikipedia
A good example of high usage of apply() is during natural language
processing (NLP) work. You'll need to apply all sorts of text cleaning
functions to strings to prepare for machine learning.

Brief Plotting
Another great thing about pandas is that it integrates with Matplotlib, so
you get the ability to plot directly off DataFrames and Series. To get
started we need to import Matplotlib ( pip install matplotlib ):
import matplotlib.pyplot as plt

plt.rcParams.update({'font.size': 20, 'figure.figsize': (10, 8)}) # set font and

plot size to be larger

Now we can begin. There won't be a lot of coverage on plotting, but it


should be enough to explore you're data easily.

Plotting Tip
For categorical variables utilize Bar Charts* and Boxplots.

For continuous variables utilize Histograms, Scatterplots,


Line graphs, and Boxplots.

Let's plot the relationship between ratings and revenue. All we need to
do is call .plot() on movies_df with some info about how to construct the
plot:

movies_df.plot(kind='scatter', x='rating', y='revenue_millions',

title='Revenue (millions) vs Rating');

RESULT:
What's with the semicolon? It's not a syntax error, just a way to hide
the <matplotlib.axes._subplots.AxesSubplot at 0x26613b5cc18> output when
plotting in Jupyter notebooks.
If we want to plot a simple Histogram based on a single column, we can
call plot on a column:

movies_df['rating'].plot(kind='hist', title='Rating');

RESULT:
Do you remember the .describe() example at the beginning of this
tutorial? Well, there's a graphical representation of the interquartile
range, called the Boxplot. Let's recall what describe() gives us on the
ratings column:

movies_df['rating'].describe()

OUT:

count 1000.000000

mean 6.723200

std 0.945429
min 1.900000

25% 6.200000

50% 6.800000

75% 7.400000

max 9.000000

Name: rating, dtype: float64

Using a Boxplot we can visualize this data:

movies_df['rating'].plot(kind="box");

RESULT:
Source: *Flowing Data*
By combining categorical and continuous data, we can create a Boxplot
of revenue that is grouped by the Rating Category we created above:
movies_df.boxplot(column='revenue_millions', by='rating_category');

RESULT:

That's the general idea of plotting with pandas. There's too many plots
to mention, so definitely take a look at the plot() docs here for more
information on what it can do.

Wrapping up
Exploring, cleaning, transforming, and visualization data with pandas in
Python is an essential skill in data science. Just cleaning wrangling data
is 80% of your job as a Data Scientist. After a few projects and some
practice, you should be very comfortable with most of the basics.

To keep improving, view the extensive tutorials offered by the official


pandas docs, follow along with a few Kaggle kernels, and keep working
on your own projects!

Resources
Applied Data Science with Python — Coursera
Covers an intro to Python, Visualization, Machine Learning,
Text Mining, and Social Network Analysis in Python. Also
provides many challenging quizzes and assignments to further
enhance your learning.

Complete SQL Bootcamp — Udemy


An excellent course for learning SQL. The instructor explains
everything from beginner to advanced SQL queries and
techniques, and provides many exercises to help you learn.

Take the internet's best data scie


nce courses

You might also like