0% found this document useful (0 votes)
13 views33 pages

Panduan Pandas

Uploaded by

newgame09871
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
13 views33 pages

Panduan Pandas

Uploaded by

newgame09871
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 33

28/12/2020 articles/notebook.

ipynb at master · LearnDataSci/articles · GitHub

Python Pandas Tutorial: A


Complete Introduction for
Beginners
The pandas package is the most important tool at the disposal of
Data Scientists and Analysts working in Python today. The powerful
machine learning and glamorous visualization tools may get all the
attention, but pandas is the backbone of most data projects.

[pandas] is derived from the term "panel data",


an econometrics term for data sets that include
observations over multiple time periods for the
same individuals. — Wikipedia
(https://en.wikipedia.org/wiki/Pandas_%28software%29)

If you're thinking about data science as a career, then it is


imperative that one of the first things you do is learn pandas. In this
post, we will go over the essential bits of information about pandas,
including how to install it, its uses, and how it works with other
common Python data analysis packages such as matplotlib and
sci-kit learn.

What's Pandas for?


Pandas has so many uses that it might make sense to list the things
it can't do instead of what it can do.

This tool is essentially your data’s home. Through pandas, you get
acquainted with your data by cleaning, transforming, and analyzing
it.

For example, say you want to explore a dataset stored in a CSV on


your computer. Pandas will extract the data from that CSV into a
DataFrame — a table, basically — then let you do things like:

Calculate statistics and answer questions about the data,


https://github.com/LearnDataSci/articles/blob/master/Python Pandas Tutorial A Complete Introduction for Beginners/notebook.ipynb 2/41
28/12/2020 articles/notebook.ipynb at master · LearnDataSci/articles · GitHub

like

- What's the average, median, max, or min of ea


ch column?
- Does column A correlate with column B?
- What does the distribution of data in column
C look like?

Clean the data by doing things like removing missing


values and filtering rows or columns by some criteria

Visualize the data with help from Matplotlib. Plot bars,


lines, histograms, bubbles, and more.

Store the cleaned, transformed data back into a CSV, other


file or database

Before you jump into the modeling or the complex visualizations you
need to have a good understanding of the nature of your dataset
and pandas is the best avenue through which to do that.

How does pandas fit into the data


science toolkit?
Not only is the pandas library a central component of the data
science toolkit but it is used in conjunction with other libraries in that
collection.

Pandas is built on top of the NumPy package, meaning a lot of the


structure of NumPy is used or replicated in Pandas. Data in pandas
is often used to feed statistical analysis in SciPy, plotting functions
from Matplotlib, and machine learning algorithms in Scikit-learn.

Jupyter Notebooks offer a good environment for using pandas to do


data exploration and modeling, but pandas can also be used in text
editors just as easily.

Jupyter Notebooks give us the ability to execute code in a particular


cell as opposed to running the entire file. This saves a lot of time
when working with large datasets and complex transformations.
Notebooks also provide an easy way to visualize pandas’
DataFrames and plots. As a matter of fact, this article was created
entirely in a Jupyter Notebook.

When should you start using pandas?

If you do not have any experience coding in Python, then you


should stay away from learning pandas until you do. You don’t have
to be at the level of the software engineer, but you should be adept
at the basics, such as lists, tuples, dictionaries, functions, and
iterations Also I’d also recommend familiarizing yourself with
https://github.com/LearnDataSci/articles/blob/master/Python Pandas Tutorial A Complete Introduction for Beginners/notebook.ipynb 3/41
28/12/2020 articles/notebook.ipynb at master · LearnDataSci/articles · GitHub
iterations. Also, I d also recommend familiarizing yourself with
NumPy due to the similarities mentioned above.

If you're looking for a good place to learn Python, Python for


Everybody (https://www.learndatasci.com/out/coursera-
programming-everybody-getting-started-python/) on Coursera is
great (and Free).

Moreover, for those of you looking to do a data science bootcamp


(https://www.learndatasci.com/articles/thinkful-data-science-online-
bootcamp-review/) or some other accelerated data science
education program, it's highly recommended you start learning
pandas on your own before you start the program.

Even though accelerated programs teach you pandas, better skills


beforehand means you'll be able to maximize time for learning and
mastering the more complicated material.

Pandas First Steps

Install and import


Pandas is an easy package to install. Open up your terminal
program (for Mac users) or command line (for PC users) and install
it using either of the following commands:

conda install pandas

OR

pip install pandas

Alternatively, if you're currently viewing this article in a Jupyter


notebook you can run this cell:

In [ ]: !pip install pandas

The ! at the beginning runs cells as if they were in a terminal.

To import pandas we usually import it with a shorter name since it's


used so much:

In [1]: import pandas as pd

Now to the basic components of pandas.

Core components of pandas: Series and


DataFrames
The primary two components of pandas are the Series and
DataFrame
https://github.com/LearnDataSci/articles/blob/master/Python Pandas Tutorial A Complete Introduction for Beginners/notebook.ipynb 4/41
28/12/2020 articles/notebook.ipynb at master · LearnDataSci/articles · GitHub
DataFrame.

A Series is essentially a column, and a DataFrame is a multi-


dimensional table made up of a collection of Series.

DataFrames and Series are quite similar in that many operations


that you can do with one you can do with the other, such as filling in
null values and calculating the mean.

You'll see how these components work when we start working with
data below.

Creating DataFrames from scratch


Creating DataFrames right in Python is good to know and quite
useful when testing new methods and functions you find in the
pandas docs.

There are many ways to create a DataFrame from scratch, but a


great option is to just use a simple dict.

Let's say we have a fruit stand that sells apples and oranges. We
want to have a column for each fruit and a row for each customer
purchase. To organize this as a dictionary for pandas we could do
something like:

In [38]: data = {
'apples': [3, 2, 0, 1],
'oranges': [0, 3, 7, 2]
}

And then pass it to the pandas DataFrame constructor:

In [40]: purchases = pd.DataFrame(data)

purchases

Out[40]:
apples oranges

0 3 0

1 2 3

2 0 7

3 1 2

https://github.com/LearnDataSci/articles/blob/master/Python Pandas Tutorial A Complete Introduction for Beginners/notebook.ipynb 5/41


28/12/2020 articles/notebook.ipynb at master · LearnDataSci/articles · GitHub

How did that work?

Each (key, value) item in data corresponds to a column in the


resulting DataFrame.

The Index of this DataFrame was given to us on creation as the


numbers 0-3, but we could also create our own when we initialize
the DataFrame.

Let's have customer names as our index:

In [42]: purchases = pd.DataFrame(data, index=['Jun


e', 'Robert', 'Lily', 'David'])

purchases

Out[42]: apples oranges

June 3 0

Robert 2 3

Lily 0 7

David 1 2

So now we could locate a customer's order by using their name:

In [46]: purchases.loc['June']

Out[46]: apples 3
oranges 0
Name: June, dtype: int64

There's more on locating and extracting data from the DataFrame


later, but now you should be able to create a DataFrame with any
random data to learn on.

Let's move on to some quick methods for creating DataFrames from


various other sources.

How to read in data


It’s quite simple to load data from various file formats into a
DataFrame. In the following examples we'll keep using our apples
and oranges data, but this time it's coming from various files.

Reading data from CSVs


With CSV files all you need is a single line to load in the data:

In [48]: df = pd.read_csv('purchases.csv')

df
https://github.com/LearnDataSci/articles/blob/master/Python Pandas Tutorial A Complete Introduction for Beginners/notebook.ipynb 6/41
28/12/2020 articles/notebook.ipynb at master · LearnDataSci/articles · GitHub
df

Out[48]: Unnamed: 0 apples oranges

0 June 3 0

1 Robert 2 3

2 Lily 0 7

3 David 1 2

CSVs don't have indexes like our DataFrames, so all we need to do


is just designate the index_col when reading:

In [53]: df = pd.read_csv('purchases.csv', index_co


l=0)

df

Out[53]:
apples oranges

June 3 0

Robert 2 3

Lily 0 7

David 1 2

Here we're setting the index to be column zero.

You'll find that most CSVs won't ever have an index column and so
usually you don't have to worry about this step.

Reading data from JSON


If you have a JSON file — which is essentially a stored Python dict
— pandas can read this just as easily:

In [55]: df = pd.read_json('purchases.json')

df

Out[55]: apples oranges

David 1 2

June 3 0

Lily 0 7

Robert 2 3

Notice this time our index came with us correctly since using JSON
allowed indexes to work through nesting. Feel free to open
data file.json in a notepad so you can see how it works.
https://github.com/LearnDataSci/articles/blob/master/Python Pandas Tutorial A Complete Introduction for Beginners/notebook.ipynb 7/41
28/12/2020 articles/notebook.ipynb at master · LearnDataSci/articles · GitHub
data_file.json in a notepad so you can see how it works.

Pandas will try to figure out how to create a DataFrame by


analyzing structure of your JSON, and sometimes it doesn't get it
right. Often you'll need to set the orient keyword argument
depending on the structure, so check out read_json docs
(https://pandas.pydata.org/pandas-
docs/stable/generated/pandas.read_json.html) about that argument
to see which orientation you're using.

Reading data from a SQL database


If you’re working with data from a SQL database you need to first
establish a connection using an appropriate Python library, then
pass a query to pandas. Here we'll use SQLite to demonstrate.

First, we need pysqlite3 installed, so run this command in your


terminal:

pip install pysqlite3

Or run this cell if you're in a notebook:

In [ ]: !pip install pysqlite3

sqlite3 is used to create a connection to a database which we can


then use to generate a DataFrame through a SELECT query.

So first we'll make a connection to a SQLite database file:

In [56]: import sqlite3

con = sqlite3.connect("database.db")

Note: If you have data in PostgreSQL, MySQL, or some other SQL


server, you'll need to obtain the right Python library to make a
connection. For example, psycopg2 (link
(http://initd.org/psycopg/download/)) is a commonly used library for
making connections to PostgreSQL. Furthermore, you would make
a connection to a database URI instead of a file like we did here
with SQLite. For a great course on SQL check out The Complete
SQL Bootcamp (https://learndatasci.com/out/udemy-the-complete-
sql-bootcamp/) on Udemy

In this SQLite database we have a table called purchases, and our


index is in a column called "index".

By passing a SELECT query and our con, we can read from the
purchases table:

In [62]: df = pd.read_sql_query("SELECT * FROM purc


hases", con)

https://github.com/LearnDataSci/articles/blob/master/Python Pandas Tutorial A Complete Introduction for Beginners/notebook.ipynb 8/41


28/12/2020 articles/notebook.ipynb at master · LearnDataSci/articles · GitHub

df

Out[62]:
index apples oranges

0 June 3 0

1 Robert 2 3

2 Lily 0 7

3 David 1 2

Just like with CSVs, we could pass index_col='index', but we


can also set an index after-the-fact:

In [64]: df = df.set_index('index')

df

Out[64]:
apples oranges

index

June 3 0

Robert 2 3

Lily 0 7

David 1 2

In fact, we could use set_index() on any DataFrame using any


column at any time. Indexing Series and DataFrames is a very
common task, and the different ways of doing it is worth
remembering.

Converting back to a CSV, JSON, or SQL


So after extensive work on cleaning your data, you’re now ready to
save it as a file of your choice. Similar to the ways we read in data,
pandas provides intuitive commands to save it:

In [ ]: df.to_csv('new_purchases.csv')

df.to_json('new_purchases.json')

df.to_sql('new_purchases', con)

When we save JSON and CSV files, all we have to input into those
functions is our desired filename with the appropriate file extension.
With SQL, we’re not creating a new file but instead inserting a new
table into the database using our con variable from before.

Let's move on to importing some real-world data and detailing a few


of the operations you'll be using a lot.
https://github.com/LearnDataSci/articles/blob/master/Python Pandas Tutorial A Complete Introduction for Beginners/notebook.ipynb 9/41
28/12/2020 articles/notebook.ipynb at master · LearnDataSci/articles · GitHub
p y g

Most important DataFrame


operations
DataFrames possess hundreds of methods and other operations
that are crucial to any analysis. As a beginner, you should know the
operations that perform simple transformations of your data and
those that provide fundamental statistical analysis.

Let's load in the IMDB movies dataset to begin:

In [2]: movies_df = pd.read_csv("IMDB-Movie-Data.c


sv", index_col="Title")

We're loading this dataset from a CSV and designating the movie
titles to be our index.

Viewing your data


The first thing to do when opening a new dataset is print out a few
rows to keep as a visual reference. We accomplish this with
.head():

In [5]: movies_df.head()

Out[5]:
Rank Genre Desc

Title

A gro
Guardians inter
of the 1 Action,Adventure,Sci-Fi crim
Galaxy are f
...

Follo
clues
Prometheus 2 Adventure,Mystery,Sci-Fi origi
man
te...

Thre
are
kidna
Split 3 Horror,Thriller
by a
with
diag

In a
hum
Sing 4 Animation,Comedy,Family anim
hust
thea
https://github.com/LearnDataSci/articles/blob/master/Python Pandas Tutorial A Complete Introduction for Beginners/notebook.ipynb 10/41
28/12/2020 articles/notebook.ipynb at master · LearnDataSci/articles · GitHub

A se
gove
Suicide
5 Action,Adventure,Fantasy agen
Squad
recru
som

.head() outputs the first five rows of your DataFrame by default,


but we could also pass a number as well: movies_df.head(10)
would output the top ten rows, for example.

To see the last five rows use .tail(). tail() also accepts a
number, and in this case we printing the bottom two rows.:

In [6]: movies_df.tail(2)

Out[6]:
Rank Genre Description

Title

A pair of
friends
Search
999 Adventure,Comedy embark on a
Party
mission to
reuni...

A stuffy
Nine businessma
1000 Comedy,Family,Fantasy
Lives finds himself
trapped ins..

Typically when we load in a dataset, we like to view the first five or


so rows to see what's under the hood. Here we can see the names
of each column, the index, and examples of values in each row.

You'll notice that the index in our DataFrame is the Title column,
which you can tell by how the word Title is slightly lower than the
rest of the columns.

Getting info about your data


.info() should be one of the very first commands you run after
loading your data:

In [3]: movies_df.info()

<class 'pandas.core.frame.DataFrame'>
Index: 1000 entries, Guardians of the Galax
y to Nine Lives
Data columns (total 11 columns):
Rank 1000 non-null int64
Genre 1000 non-null object
https://github.com/LearnDataSci/articles/blob/master/Python Pandas Tutorial A Complete Introduction for Beginners/notebook.ipynb 11/41
28/12/2020 articles/notebook.ipynb at master · LearnDataSci/articles · GitHub
Description 1000 non-null object
Director 1000 non-null object
Actors 1000 non-null object
Year 1000 non-null int64
Runtime (Minutes) 1000 non-null int64
Rating 1000 non-null float64
Votes 1000 non-null int64
Revenue (Millions) 872 non-null float64
Metascore 936 non-null float64
dtypes: float64(3), int64(4), object(4)
memory usage: 93.8+ KB

.info() provides the essential details about your dataset, such as


the number of rows and columns, the number of non-null values,
what type of data is in each column, and how much memory your
DataFrame is using.

Notice in our movies dataset we have some obvious missing values


in the Revenue and Metascore columns. We'll look at how to handle
those in a bit.

Seeing the datatype quickly is actually quite useful. Imagine you just
imported some JSON and the integers were recorded as strings.
You go to do some arithmetic and find an "unsupported operand"
Exception because you can't do math with strings. Calling .info()
will quickly point out that your column you thought was all integers
are actually string objects.

Another fast and useful attribute is .shape, which outputs just a


tuple of (rows, columns):

In [4]: movies_df.shape

Out[4]: (1000, 11)

Note that .shape has no parentheses and is a simple tuple of


format (rows, columns). So we have 1000 rows and 11 columns in
our movies DataFrame.

You'll be going to .shape a lot when cleaning and transforming


data. For example, you might filter some rows based on some
criteria and then want to know quickly how many rows were
removed.

Handling duplicates

This dataset does not have duplicate rows, but it is always


important to verify you aren't aggregating duplicate rows.

To demonstrate, let's simply just double up our movies DataFrame


by appending it to itself:

In [81]: temp_df = movies_df.append(movies_df)

https://github.com/LearnDataSci/articles/blob/master/Python Pandas Tutorial A Complete Introduction for Beginners/notebook.ipynb 12/41


28/12/2020 articles/notebook.ipynb at master · LearnDataSci/articles · GitHub

temp_df.shape

Out[81]: (2000, 11)

Using append() will return a copy without affecting the original


DataFrame. We are capturing this copy in temp so we aren't
working with the real data.

Notice call .shape quickly proves our DataFrame rows have


doubled.

Now we can try dropping duplicates:

In [82]: temp_df = temp_df.drop_duplicates()

temp_df.shape

Out[82]: (1000, 11)

Just like append(), the drop_duplicates() method will also return


a copy of your DataFrame, but this time with duplicates removed.
Calling .shape confirms we're back to the 1000 rows of our original
dataset.

It's a little verbose to keep assigning DataFrames to the same


variable like in this example. For this reason, pandas has the
inplace keyword argument on many of its methods. Using
inplace=True will modify the DataFrame object in place:

In [83]: temp_df.drop_duplicates(inplace=True)

Now our temp_df will have the transformed data automatically.

Another important argument for drop_duplicates() is keep, which


has three possible options:

first: (default) Drop duplicates except for the first


occurrence.
last: Drop duplicates except for the last occurrence.
False: Drop all duplicates.

Since we didn't define the keep arugment in the previous example it


was defaulted to first. This means that if two rows are the same
pandas will drop the second row and keep the first row. Using last
has the opposite effect: the first row is dropped.

keep, on the other hand, will drop all duplicates. If two rows are the
same then both will be dropped. Watch what happens to temp_df:

In [85]: temp_df = movies_df.append(movies_df) # m


ake a new copy

temp_df.drop_duplicates(inplace=True, keep
=False)
https://github.com/LearnDataSci/articles/blob/master/Python Pandas Tutorial A Complete Introduction for Beginners/notebook.ipynb 13/41
28/12/2020 articles/notebook.ipynb at master · LearnDataSci/articles · GitHub

temp_df.shape

Out[85]: (0, 11)

Since all rows were duplicates, keep=False dropped them all


resulting in zero rows being left over. If you're wondering why you
would want to do this, one reason is that it allows you to locate all
duplicates in your dataset. When conditional selections are shown
below you'll see how to do that.

Column cleanup
Many times datasets will have verbose column names with
symbols, upper and lowercase words, spaces, and typos. To make
selecting data by column name easier we can spend a little time
cleaning up their names.

Here's how to print the column names of our dataset:

In [86]: movies_df.columns

Out[86]: Index(['Rank', 'Genre', 'Description', 'Dir


ector', 'Actors', 'Year',
'Runtime (Minutes)', 'Rating', 'Vote
s', 'Revenue (Millions)',
'Metascore'],
dtype='object')

Not only does .columns come in handy if you want to rename


columns by allowing for simple copy and paste, it's also useful if you
need to understand why you are receiving a Key Error when
selecting data by column.

We can use the .rename() method to rename certain or all


columns via a dict. We don't want parentheses, so let's rename
those:

In [87]: movies_df.rename(columns={
'Runtime (Minutes)': 'Runtime',
'Revenue (Millions)': 'Revenue_mil
lions'
}, inplace=True)

movies_df.columns

Out[87]: Index(['Rank', 'Genre', 'Description', 'Dir


ector', 'Actors', 'Year', 'Runtime',
'Rating', 'Votes', 'Revenue_million
s', 'Metascore'],
dtype='object')

Excellent. But what if we want to lowercase all names? Instead of


using .rename() we could also set a list of names to the columns
like so:
https://github.com/LearnDataSci/articles/blob/master/Python Pandas Tutorial A Complete Introduction for Beginners/notebook.ipynb 14/41
28/12/2020 articles/notebook.ipynb at master · LearnDataSci/articles · GitHub
like so:

In [92]: movies_df.columns = ['rank', 'genre', 'des


cription', 'director', 'actors', 'year',
'runtime',
'rating', 'votes', 'r
evenue_millions', 'metascore']

movies_df.columns

Out[92]: Index(['rank', 'genre', 'description', 'dir


ector', 'actors', 'year', 'runtime',
'rating', 'votes', 'revenue_million
s', 'metascore'],
dtype='object')

But that's too much work. Instead of just renaming each column
manually we can do a list comprehension:

In [93]: movies_df.columns = [col.lower() for col i


n movies_df]

movies_df.columns

Out[93]: Index(['rank', 'genre', 'description', 'dir


ector', 'actors', 'year', 'runtime',
'rating', 'votes', 'revenue_million
s', 'metascore'],
dtype='object')

list (and dict) comprehensions come in handy a lot when


working with pandas and data in general.

It's a good idea to lowercase, remove special characters, and


replace spaces with underscores if you'll be working with a dataset
for some time.

How to work with missing values


When exploring data, you’ll most likely encounter missing or null
values, which are essentially placeholders for non-existent values.
Most commonly you'll see Python's None or NumPy's np.nan, each
of which are handled differently in some situations.

There are two options in dealing with nulls:

1. Get rid of rows or columns with nulls


2. Replace nulls with non-null values, a technique known as
imputation

Let's calculate to total number of nulls in each column of our


dataset. The first step is to check which cells in our DataFrame are
null:

In [99]: movies_df.isnull()

https://github.com/LearnDataSci/articles/blob/master/Python Pandas Tutorial A Complete Introduction for Beginners/notebook.ipynb 15/41


28/12/2020 articles/notebook.ipynb at master · LearnDataSci/articles · GitHub

Out[99]:
rank genre description director a

Title

Guardians
of the False False False False F
Galaxy

Prometheus False False False False F

Split False False False False F

Sing False False False False F

Suicide
False False False False F
Squad

Notice isnull() returns a DataFrame where each cell is either


True or False depending on that cell's null status.

To count the number of nulls in each column we use an aggregate


function for summing:

In [100]: movies_df.isnull().sum()

Out[100]: rank 0
genre 0
description 0
director 0
actors 0
year 0
runtime 0
rating 0
votes 0
revenue_millions 128
metascore 64
dtype: int64

.isnull() just by iteself isn't very useful, and is usually used in


conjunction with other methods, like sum().

We can see now that our data has 128 missing values for
revenue_millions and 64 missing values for metascore.

Removing null values

Data Scientists and Analysts regularly face the dilemma of dropping


or imputing null values, and is a decision that requires intimate
knowledge of your data and its context. Overall, removing null data
is only suggested if you have a small amount of missing data.

Remove nulls is pretty simple:

In [101]: movies_df.dropna()

Out[101]:
rank genre
https://github.com/LearnDataSci/articles/blob/master/Python Pandas Tutorial A Complete Introduction for Beginners/notebook.ipynb 16/41
28/12/2020 articles/notebook.ipynb at master · LearnDataSci/articles · GitHub

Title

Guardians of
1 Action,Adventure,Sci-Fi
the Galaxy

Prometheus 2 Adventure,Mystery,Sci-Fi

Split 3 Horror,Thriller

Sing 4 Animation,Comedy,Family

Suicide
5 Action,Adventure,Fantasy
Squad

The Great
6 Action,Adventure,Fantasy
Wall

La La Land 7 Comedy,Drama,Music

The Lost City


9 Action,Adventure,Biography
of Z

Passengers 10 Adventure,Drama,Romance

Fantastic
Beasts and
11 Adventure,Family,Fantasy
Where to Find
Them

https://github.com/LearnDataSci/articles/blob/master/Python Pandas Tutorial A Complete Introduction for Beginners/notebook.ipynb 17/41


28/12/2020 articles/notebook.ipynb at master · LearnDataSci/articles · GitHub

Hidden
12 Biography,Drama,History
Figures

Rogue One 13 Action,Adventure,Sci-Fi

Moana 14 Animation,Adventure,Comedy

Colossal 15 Action,Comedy,Drama

The Secret
16 Animation,Adventure,Comedy
Life of Pets

Hacksaw
17 Biography,Drama,History
Ridge

Jason Bourne 18 Action,Thriller

Lion 19 Biography,Drama

Arrival 20 Drama,Mystery,Sci-Fi

Gold 21 Adventure,Drama,Thriller

https://github.com/LearnDataSci/articles/blob/master/Python Pandas Tutorial A Complete Introduction for Beginners/notebook.ipynb 18/41


28/12/2020 articles/notebook.ipynb at master · LearnDataSci/articles · GitHub

Manchester
22 Drama
by the Sea

Trolls 24 Animation,Adventure,Comedy

Independence
Day: 25 Action,Adventure,Sci-Fi
Resurgence

Bad Moms 29 Comedy

Assassin's
30 Action,Adventure,Drama
Creed

Why Him? 31 Comedy

Nocturnal
32 Drama,Thriller
Animals

X-Men:
33 Action,Adventure,Sci-Fi
Apocalypse

Deadpool 34 Action,Adventure,Comedy

Resident Evil:
The Final 35 Action,Horror,Sci-Fi
Chapter

... ... ...

That Awkward
956 Comedy,Romance
Moment

https://github.com/LearnDataSci/articles/blob/master/Python Pandas Tutorial A Complete Introduction for Beginners/notebook.ipynb 19/41


28/12/2020 articles/notebook.ipynb at master · LearnDataSci/articles · GitHub

Legion 957 Action,Fantasy,Horror

End of Watch 958 Crime,Drama,Thriller

3 Days to Kill 959 Action,Drama,Thriller

Lucky
Number 960 Crime,Drama,Mystery
Slevin

Trance 961 Crime,Drama,Mystery

Into the
962 Drama,Sci-Fi,Thriller
Forest

The Other
963 Biography,Drama,History
Boleyn Girl

I Spit on Your
964 Crime,Horror,Thriller
Grave

Texas
971 Horror,Thriller
Chainsaw 3D

Rock of Ages 973 Comedy,Drama,Musical

Scream 4 974 Horror,Mystery


https://github.com/LearnDataSci/articles/blob/master/Python Pandas Tutorial A Complete Introduction for Beginners/notebook.ipynb 20/41
28/12/2020 articles/notebook.ipynb at master · LearnDataSci/articles · GitHub

Queen of
975 Biography,Drama,Sport
Katwe

My Big Fat
Greek 976 Comedy,Family,Romance
Wedding 2

The Skin I
980 Drama,Thriller
Live In

Miracles from
981 Biography,Drama,Family
Heaven

Annie 982 Comedy,Drama,Family

Across the
983 Drama,Fantasy,Musical
Universe

Let's Be Cops 984 Comedy

Max 985 Adventure,Family

Your
986 Adventure,Comedy,Fantasy
Highness

Final
987 Horror,Thriller
Destination 5

https://github.com/LearnDataSci/articles/blob/master/Python Pandas Tutorial A Complete Introduction for Beginners/notebook.ipynb 21/41


28/12/2020 articles/notebook.ipynb at master · LearnDataSci/articles · GitHub

Endless Love 988 Drama,Romance

Underworld:
Rise of the 991 Action,Adventure,Fantasy
Lycans

Taare Zameen
992 Drama,Family,Music
Par

Resident Evil:
994 Action,Adventure,Horror
Afterlife

Project X 995 Comedy

Hostel: Part II 997 Horror

Step Up 2:
998 Drama,Music,Romance
The Streets

Nine Lives 1000 Comedy,Family,Fantasy

838 rows × 11 columns

This operation will delete any row with at least a single null value,
but it will return a new DataFrame without altering the original one.
You could specify inplace=True in this method as well.

So in the case of our dataset, this operation would remove 128 rows
where revenue_millions is null and 64 rows where metascore is
null. This obviously seems like a waste since there's perfectly good
data in the other columns of those dropped rows. That's why we'll
look at imputation next.
https://github.com/LearnDataSci/articles/blob/master/Python Pandas Tutorial A Complete Introduction for Beginners/notebook.ipynb 22/41
28/12/2020 articles/notebook.ipynb at master · LearnDataSci/articles · GitHub

Other than just dropping rows, you can also drop columns with null
values by setting axis=1:

In [102]: movies_df.dropna(axis=1)

Out[102]:
rank genre

Title

Guardians of
1 Action,Adventure,Sci-Fi
the Galaxy

Prometheus 2 Adventure,Mystery,Sci-Fi

Split 3 Horror,Thriller

Sing 4 Animation,Comedy,Family

Suicide
5 Action,Adventure,Fantasy
Squad

The Great
6 Action,Adventure,Fantasy
Wall

La La Land 7 Comedy,Drama,Music

Mindhorn 8 Comedy

The Lost City


9 Action,Adventure,Biography
of Z
https://github.com/LearnDataSci/articles/blob/master/Python Pandas Tutorial A Complete Introduction for Beginners/notebook.ipynb 23/41
28/12/2020 articles/notebook.ipynb at master · LearnDataSci/articles · GitHub

Passengers 10 Adventure,Drama,Romance

Fantastic
Beasts and
11 Adventure,Family,Fantasy
Where to Find
Them

Hidden
12 Biography,Drama,History
Figures

Rogue One 13 Action,Adventure,Sci-Fi

Moana 14 Animation,Adventure,Comedy

Colossal 15 Action,Comedy,Drama

The Secret
16 Animation,Adventure,Comedy
Life of Pets

Hacksaw
17 Biography,Drama,History
Ridge

Jason Bourne 18 Action,Thriller

Lion 19 Biography,Drama

https://github.com/LearnDataSci/articles/blob/master/Python Pandas Tutorial A Complete Introduction for Beginners/notebook.ipynb 24/41


28/12/2020 articles/notebook.ipynb at master · LearnDataSci/articles · GitHub

Arrival 20 Drama,Mystery,Sci-Fi

Gold 21 Adventure,Drama,Thriller

Manchester
22 Drama
by the Sea

Hounds of
23 Crime,Drama,Horror
Love

Trolls 24 Animation,Adventure,Comedy

Independence
Day: 25 Action,Adventure,Sci-Fi
Resurgence

Paris pieds
26 Comedy
nus

Bahubali: The
27 Action,Adventure,Drama
Beginning

Dead Awake 28 Horror,Thriller

Bad Moms 29 Comedy

https://github.com/LearnDataSci/articles/blob/master/Python Pandas Tutorial A Complete Introduction for Beginners/notebook.ipynb 25/41


28/12/2020 articles/notebook.ipynb at master · LearnDataSci/articles · GitHub

Assassin's
30 Action,Adventure,Drama
Creed

... ... ...

Texas
971 Horror,Thriller
Chainsaw 3D

Disturbia 972 Drama,Mystery,Thriller

Rock of Ages 973 Comedy,Drama,Musical

Scream 4 974 Horror,Mystery

Queen of
975 Biography,Drama,Sport
Katwe

My Big Fat
Greek 976 Comedy,Family,Romance
Wedding 2

Dark Places 977 Drama,Mystery,Thriller

Amateur
978 Comedy
Night

It's Only the


End of the 979 Drama
World

The Skin I
https://github.com/LearnDataSci/articles/blob/master/Python Pandas Tutorial A Complete Introduction for Beginners/notebook.ipynb 26/41
28/12/2020 articles/notebook.ipynb at master · LearnDataSci/articles · GitHub
The Skin I
980 Drama,Thriller
Live In

Miracles from
981 Biography,Drama,Family
Heaven

Annie 982 Comedy,Drama,Family

Across the
983 Drama,Fantasy,Musical
Universe

Let's Be Cops 984 Comedy

Max 985 Adventure,Family

Your
986 Adventure,Comedy,Fantasy
Highness

Final
987 Horror,Thriller
Destination 5

Endless Love 988 Drama,Romance

Martyrs 989 Horror

Selma 990 Biography,Drama,History

https://github.com/LearnDataSci/articles/blob/master/Python Pandas Tutorial A Complete Introduction for Beginners/notebook.ipynb 27/41


28/12/2020 articles/notebook.ipynb at master · LearnDataSci/articles · GitHub

Underworld:
Rise of the 991 Action,Adventure,Fantasy
Lycans

Taare Zameen
992 Drama,Family,Music
Par

Take Me
993 Comedy,Drama,Romance
Home Tonight

Resident Evil:
994 Action,Adventure,Horror
Afterlife

Project X 995 Comedy

Secret in
996 Crime,Drama,Mystery
Their Eyes

Hostel: Part II 997 Horror

Step Up 2:
998 Drama,Music,Romance
The Streets

Search Party 999 Adventure,Comedy

Nine Lives 1000 Comedy,Family,Fantasy

https://github.com/LearnDataSci/articles/blob/master/Python Pandas Tutorial A Complete Introduction for Beginners/notebook.ipynb 28/41


28/12/2020 articles/notebook.ipynb at master · LearnDataSci/articles · GitHub

1000 rows × 9 columns

In our dataset, this operation would drop the revenue_millions


and metascore columns.

Intuition side note: What's with this axis=1 parameter?

It's not immediately obvious where axis comes from and why you
need it to be 1 for it to affect columns. To see why, just look at the
.shape output:

In [103]: movies_df.shape

Out[103]: (1000, 11)

As we learned above, this is a tuple that represents the shape of the


DataFrame, i.e. 1000 rows and 11 columns. Note that the rows are
at index zero of this tuple and columns are at index one of this
tuple. This is why axis=1 affects columns. This comes from NumPy,
and is a great example of why learning NumPy is worth your time.

Imputation
Imputation is a conventional feature engineering technique used to
keep valuable data that have null values.

There may be instances where dropping every row with a null value
removes too big a chunk from your dataset, so instead we can
impute that null with another value, usually the mean or the median
of that column.

Let's look at imputing the missing values in the revenue_millions


column. First we'll extract that column into its own variable:

In [104]: revenue = movies_df['revenue_millions']

Using square brackets is the general way we select columns in a


DataFrame.

If you remember back to when we created DataFrames from


scratch, the keys of the dict ended up as column names. Now
when we select columns of a DataFrame, we use brackets just like
if we were accessing a Python dictionary.

revenue now contains a Series:

In [105]: revenue.head()

Out[105]: Title
Guardians of the Galaxy 333.13
Prometheus 126.46
Split 138.12
Si 270 32
https://github.com/LearnDataSci/articles/blob/master/Python Pandas Tutorial A Complete Introduction for Beginners/notebook.ipynb 29/41
28/12/2020 articles/notebook.ipynb at master · LearnDataSci/articles · GitHub
Sing 270.32
Suicide Squad 325.02
Name: revenue_millions, dtype: float64

Slightly different formatting than a DataFrame, but we still have our


Title index.

We'll impute the missing values of revenue using the mean. Here's
the mean value:

In [107]: revenue_mean = revenue.mean()

revenue_mean

Out[107]: 82.95637614678897

With the mean, let's fill the nulls using fillna():

In [108]: revenue.fillna(revenue_mean, inplace=True)

We have now replaced all nulls in revenue with the mean of the
column. Notice that by using inplace=True we have actually
affected the original movies_df:

In [114]: movies_df.isnull().sum()

Out[114]: rank 0
genre 0
description 0
director 0
actors 0
year 0
runtime 0
rating 0
votes 0
revenue_millions 0
metascore 64
dtype: int64

Imputing an entire column with the same value like this is a basic
example. It would be a better idea to try a more granular imputation
by Genre or Director.

For example, you would find the mean of the revenue generated in
each genre individually and impute the nulls in each genre with that
genre's mean.

Let's now look at more ways to examine and understand the


dataset.

Understanding your variables

Using describe() on an entire DataFrame we can get a summary


of the distribution of continuous variables:
https://github.com/LearnDataSci/articles/blob/master/Python Pandas Tutorial A Complete Introduction for Beginners/notebook.ipynb 30/41
28/12/2020 articles/notebook.ipynb at master · LearnDataSci/articles · GitHub
o t e d st but o o co t uous a ab es

In [115]: movies_df.describe()

Out[115]:
rank year runtime ra

count 1000.000000 1000.000000 1000.000000 10

mean 500.500000 2012.783000 113.172000 6.

std 288.819436 3.205962 18.810908 0.

min 1.000000 2006.000000 66.000000 1.

25% 250.750000 2010.000000 100.000000 6.

50% 500.500000 2014.000000 111.000000 6.

75% 750.250000 2016.000000 123.000000 7.

max 1000.000000 2016.000000 191.000000 9.

Understanding which numbers are continuous also comes in handy


when thinking about the type of plot to use to represent your data
visually.

.describe() can also be used on a categorical variable to get the


count of rows, unique count of categories, top category, and freq of
top category:

In [116]: movies_df['genre'].describe()

Out[116]: count 1000


unique 207
top Action,Adventure,Sci-Fi
freq 50
Name: genre, dtype: object

This tells us that the genre column has 207 unique values, the top
value is Action/Adventure/Sci-Fi, which shows up 50 times (freq).

.value_counts() can tell us the frequency of all values in a


column:

In [119]: movies_df['genre'].value_counts().head(10)

Out[119]: Action,Adventure,Sci-Fi 50
Drama 48
Comedy,Drama,Romance 35
Comedy 32
Drama,Romance 31
Action,Adventure,Fantasy 27
Comedy,Drama 27
Animation,Adventure,Comedy 27
Comedy,Romance 26
Crime,Drama,Thriller 24
Name: genre, dtype: int64

Relationships between continuous variables


https://github.com/LearnDataSci/articles/blob/master/Python Pandas Tutorial A Complete Introduction for Beginners/notebook.ipynb 31/41
28/12/2020 articles/notebook.ipynb at master · LearnDataSci/articles · GitHub

By using the correlation method .corr() we can generate the


relationship between each continuous variable:

In [120]: movies_df.corr()

Out[120]: rank year runtime

rank 1.000000 -0.261605 -0.221739

year -0.261605 1.000000 -0.164900

runtime -0.221739 -0.164900 1.000000

rating -0.219555 -0.211219 0.392214

votes -0.283876 -0.411904 0.407062

revenue_millions -0.252996 -0.117562 0.247834

metascore -0.191869 -0.079305 0.211978

Correlation tables are a numerical representation of the bivariate


relationships in the dataset.

Positive numbers indicate a positive correlation — one goes up the


other goes up — and negative numbers represent an inverse
correlation — one goes up the other goes down. 1.0 indicates a
perfect correlation.

So looking in the first row, first column we see rank has a perfect
correlation with itself, which is obvious. On the other hand, the
correlation between votes and revenue_millions is 0.6. A little
more interesting.

Examining bivariate relationships comes in handy when you have


an outcome or dependent variable in mind and would like to see the
features most correlated to the increase or decrease of the
outcome. You can visually represent bivariate relationships with
scatterplots (seen below in the plotting section).

For a deeper look into data summarizations check out Essential


Statistics for Data Science
(https://www.learndatasci.com/tutorials/data-science-statistics-
using-python/).

Let's now look more at manipulating DataFrames.

DataFrame slicing, selecting, extracting


Up until now we've focused on some basic summaries of our data.
We've learned about simple column extraction using single
brackets, and we imputed null values in a column using fillna().
Below are the other methods of slicing, selecting, and extracting
you'll need to use constantly.
https://github.com/LearnDataSci/articles/blob/master/Python Pandas Tutorial A Complete Introduction for Beginners/notebook.ipynb 32/41
28/12/2020 articles/notebook.ipynb at master · LearnDataSci/articles · GitHub

It's important to note that, although many methods are the same,
DataFrames and Series have different attributes, so you'll need be
sure to know which type you are working with or else you will
receive attribute errors.

Let's look at working with columns first.

By column

You already saw how to extract a column using square brackets like
this:

In [125]: genre_col = movies_df['genre']

type(genre_col)

Out[125]: pandas.core.series.Series

This will return a Series. To extract a column as a DataFrame, you


need to pass a list of column names. In our case that's just a single
column:

In [126]: genre_col = movies_df[['genre']]

type(genre_col)

Out[126]: pandas.core.frame.DataFrame

Since it's just a list, adding another column name is easy:

In [127]: subset = movies_df[['genre', 'rating']]

subset.head()

Out[127]: genre rating

Title

Guardians of
Action,Adventure,Sci-Fi 8.1
the Galaxy

Prometheus Adventure,Mystery,Sci-Fi 7.0

Split Horror,Thriller 7.3

Sing Animation,Comedy,Family 7.2

Suicide Squad Action,Adventure,Fantasy 6.2

Now we'll look at getting data by rows.

By rows

For rows, we have two options:


https://github.com/LearnDataSci/articles/blob/master/Python Pandas Tutorial A Complete Introduction for Beginners/notebook.ipynb 33/41
28/12/2020 articles/notebook.ipynb at master · LearnDataSci/articles · GitHub
o o s, e a e t o opt o s

.loc - locates by name


.iloc- locates by numerical index

Remember that we are still indexed by movie Title, so to use .loc


we give it the Title of a movie:

In [128]: prom = movies_df.loc["Prometheus"]

prom

Out[128]: rank
2
genre
Adventure,Mystery,Sci-Fi
description Following clues to the
origin of mankind, a te...
director
Ridley Scott
actors Noomi Rapace, Logan Mar
shall-Green, Michael Fa...
year
2012
runtime
124
rating
7
votes
485820
revenue_millions
126.46
metascore
65
Name: Prometheus, dtype: object

On the other hand, with iloc we give it the numerical index of


Prometheus:

In [130]: prom = movies_df.iloc[1]

loc and iloc can be thought of as similar to Python list slicing.


To show this even further, let's select multiple rows.

How would you do it with a list? In Python, just slice with brackets
like example_list[1:4]. It's works the same way in pandas:

In [132]: movie_subset = movies_df.loc['Prometheus':


'Sing']

movie_subset = movies_df.iloc[1:4]

movie_subset

Out[132]: rank genre desc

Title

Follow
https://github.com/LearnDataSci/articles/blob/master/Python Pandas Tutorial A Complete Introduction for Beginners/notebook.ipynb 34/41

You might also like