Panduan Pandas
Panduan Pandas
This tool is essentially your data’s home. Through pandas, you get
acquainted with your data by cleaning, transforming, and analyzing
it.
like
Before you jump into the modeling or the complex visualizations you
need to have a good understanding of the nature of your dataset
and pandas is the best avenue through which to do that.
OR
You'll see how these components work when we start working with
data below.
Let's say we have a fruit stand that sells apples and oranges. We
want to have a column for each fruit and a row for each customer
purchase. To organize this as a dictionary for pandas we could do
something like:
In [38]: data = {
'apples': [3, 2, 0, 1],
'oranges': [0, 3, 7, 2]
}
purchases
Out[40]:
apples oranges
0 3 0
1 2 3
2 0 7
3 1 2
purchases
June 3 0
Robert 2 3
Lily 0 7
David 1 2
In [46]: purchases.loc['June']
Out[46]: apples 3
oranges 0
Name: June, dtype: int64
In [48]: df = pd.read_csv('purchases.csv')
df
https://github.com/LearnDataSci/articles/blob/master/Python Pandas Tutorial A Complete Introduction for Beginners/notebook.ipynb 6/41
28/12/2020 articles/notebook.ipynb at master · LearnDataSci/articles · GitHub
df
0 June 3 0
1 Robert 2 3
2 Lily 0 7
3 David 1 2
df
Out[53]:
apples oranges
June 3 0
Robert 2 3
Lily 0 7
David 1 2
You'll find that most CSVs won't ever have an index column and so
usually you don't have to worry about this step.
In [55]: df = pd.read_json('purchases.json')
df
David 1 2
June 3 0
Lily 0 7
Robert 2 3
Notice this time our index came with us correctly since using JSON
allowed indexes to work through nesting. Feel free to open
data file.json in a notepad so you can see how it works.
https://github.com/LearnDataSci/articles/blob/master/Python Pandas Tutorial A Complete Introduction for Beginners/notebook.ipynb 7/41
28/12/2020 articles/notebook.ipynb at master · LearnDataSci/articles · GitHub
data_file.json in a notepad so you can see how it works.
con = sqlite3.connect("database.db")
By passing a SELECT query and our con, we can read from the
purchases table:
df
Out[62]:
index apples oranges
0 June 3 0
1 Robert 2 3
2 Lily 0 7
3 David 1 2
In [64]: df = df.set_index('index')
df
Out[64]:
apples oranges
index
June 3 0
Robert 2 3
Lily 0 7
David 1 2
In [ ]: df.to_csv('new_purchases.csv')
df.to_json('new_purchases.json')
df.to_sql('new_purchases', con)
When we save JSON and CSV files, all we have to input into those
functions is our desired filename with the appropriate file extension.
With SQL, we’re not creating a new file but instead inserting a new
table into the database using our con variable from before.
We're loading this dataset from a CSV and designating the movie
titles to be our index.
In [5]: movies_df.head()
Out[5]:
Rank Genre Desc
Title
A gro
Guardians inter
of the 1 Action,Adventure,Sci-Fi crim
Galaxy are f
...
Follo
clues
Prometheus 2 Adventure,Mystery,Sci-Fi origi
man
te...
Thre
are
kidna
Split 3 Horror,Thriller
by a
with
diag
In a
hum
Sing 4 Animation,Comedy,Family anim
hust
thea
https://github.com/LearnDataSci/articles/blob/master/Python Pandas Tutorial A Complete Introduction for Beginners/notebook.ipynb 10/41
28/12/2020 articles/notebook.ipynb at master · LearnDataSci/articles · GitHub
A se
gove
Suicide
5 Action,Adventure,Fantasy agen
Squad
recru
som
To see the last five rows use .tail(). tail() also accepts a
number, and in this case we printing the bottom two rows.:
In [6]: movies_df.tail(2)
Out[6]:
Rank Genre Description
Title
A pair of
friends
Search
999 Adventure,Comedy embark on a
Party
mission to
reuni...
A stuffy
Nine businessma
1000 Comedy,Family,Fantasy
Lives finds himself
trapped ins..
You'll notice that the index in our DataFrame is the Title column,
which you can tell by how the word Title is slightly lower than the
rest of the columns.
In [3]: movies_df.info()
<class 'pandas.core.frame.DataFrame'>
Index: 1000 entries, Guardians of the Galax
y to Nine Lives
Data columns (total 11 columns):
Rank 1000 non-null int64
Genre 1000 non-null object
https://github.com/LearnDataSci/articles/blob/master/Python Pandas Tutorial A Complete Introduction for Beginners/notebook.ipynb 11/41
28/12/2020 articles/notebook.ipynb at master · LearnDataSci/articles · GitHub
Description 1000 non-null object
Director 1000 non-null object
Actors 1000 non-null object
Year 1000 non-null int64
Runtime (Minutes) 1000 non-null int64
Rating 1000 non-null float64
Votes 1000 non-null int64
Revenue (Millions) 872 non-null float64
Metascore 936 non-null float64
dtypes: float64(3), int64(4), object(4)
memory usage: 93.8+ KB
Seeing the datatype quickly is actually quite useful. Imagine you just
imported some JSON and the integers were recorded as strings.
You go to do some arithmetic and find an "unsupported operand"
Exception because you can't do math with strings. Calling .info()
will quickly point out that your column you thought was all integers
are actually string objects.
In [4]: movies_df.shape
Handling duplicates
temp_df.shape
temp_df.shape
In [83]: temp_df.drop_duplicates(inplace=True)
keep, on the other hand, will drop all duplicates. If two rows are the
same then both will be dropped. Watch what happens to temp_df:
temp_df.drop_duplicates(inplace=True, keep
=False)
https://github.com/LearnDataSci/articles/blob/master/Python Pandas Tutorial A Complete Introduction for Beginners/notebook.ipynb 13/41
28/12/2020 articles/notebook.ipynb at master · LearnDataSci/articles · GitHub
temp_df.shape
Column cleanup
Many times datasets will have verbose column names with
symbols, upper and lowercase words, spaces, and typos. To make
selecting data by column name easier we can spend a little time
cleaning up their names.
In [86]: movies_df.columns
In [87]: movies_df.rename(columns={
'Runtime (Minutes)': 'Runtime',
'Revenue (Millions)': 'Revenue_mil
lions'
}, inplace=True)
movies_df.columns
movies_df.columns
But that's too much work. Instead of just renaming each column
manually we can do a list comprehension:
movies_df.columns
In [99]: movies_df.isnull()
Out[99]:
rank genre description director a
Title
Guardians
of the False False False False F
Galaxy
Suicide
False False False False F
Squad
In [100]: movies_df.isnull().sum()
Out[100]: rank 0
genre 0
description 0
director 0
actors 0
year 0
runtime 0
rating 0
votes 0
revenue_millions 128
metascore 64
dtype: int64
We can see now that our data has 128 missing values for
revenue_millions and 64 missing values for metascore.
In [101]: movies_df.dropna()
Out[101]:
rank genre
https://github.com/LearnDataSci/articles/blob/master/Python Pandas Tutorial A Complete Introduction for Beginners/notebook.ipynb 16/41
28/12/2020 articles/notebook.ipynb at master · LearnDataSci/articles · GitHub
Title
Guardians of
1 Action,Adventure,Sci-Fi
the Galaxy
Prometheus 2 Adventure,Mystery,Sci-Fi
Split 3 Horror,Thriller
Sing 4 Animation,Comedy,Family
Suicide
5 Action,Adventure,Fantasy
Squad
The Great
6 Action,Adventure,Fantasy
Wall
La La Land 7 Comedy,Drama,Music
Passengers 10 Adventure,Drama,Romance
Fantastic
Beasts and
11 Adventure,Family,Fantasy
Where to Find
Them
Hidden
12 Biography,Drama,History
Figures
Moana 14 Animation,Adventure,Comedy
Colossal 15 Action,Comedy,Drama
The Secret
16 Animation,Adventure,Comedy
Life of Pets
Hacksaw
17 Biography,Drama,History
Ridge
Lion 19 Biography,Drama
Arrival 20 Drama,Mystery,Sci-Fi
Gold 21 Adventure,Drama,Thriller
Manchester
22 Drama
by the Sea
Trolls 24 Animation,Adventure,Comedy
Independence
Day: 25 Action,Adventure,Sci-Fi
Resurgence
Assassin's
30 Action,Adventure,Drama
Creed
Nocturnal
32 Drama,Thriller
Animals
X-Men:
33 Action,Adventure,Sci-Fi
Apocalypse
Deadpool 34 Action,Adventure,Comedy
Resident Evil:
The Final 35 Action,Horror,Sci-Fi
Chapter
That Awkward
956 Comedy,Romance
Moment
Lucky
Number 960 Crime,Drama,Mystery
Slevin
Into the
962 Drama,Sci-Fi,Thriller
Forest
The Other
963 Biography,Drama,History
Boleyn Girl
I Spit on Your
964 Crime,Horror,Thriller
Grave
Texas
971 Horror,Thriller
Chainsaw 3D
Queen of
975 Biography,Drama,Sport
Katwe
My Big Fat
Greek 976 Comedy,Family,Romance
Wedding 2
The Skin I
980 Drama,Thriller
Live In
Miracles from
981 Biography,Drama,Family
Heaven
Across the
983 Drama,Fantasy,Musical
Universe
Your
986 Adventure,Comedy,Fantasy
Highness
Final
987 Horror,Thriller
Destination 5
Underworld:
Rise of the 991 Action,Adventure,Fantasy
Lycans
Taare Zameen
992 Drama,Family,Music
Par
Resident Evil:
994 Action,Adventure,Horror
Afterlife
Step Up 2:
998 Drama,Music,Romance
The Streets
This operation will delete any row with at least a single null value,
but it will return a new DataFrame without altering the original one.
You could specify inplace=True in this method as well.
So in the case of our dataset, this operation would remove 128 rows
where revenue_millions is null and 64 rows where metascore is
null. This obviously seems like a waste since there's perfectly good
data in the other columns of those dropped rows. That's why we'll
look at imputation next.
https://github.com/LearnDataSci/articles/blob/master/Python Pandas Tutorial A Complete Introduction for Beginners/notebook.ipynb 22/41
28/12/2020 articles/notebook.ipynb at master · LearnDataSci/articles · GitHub
Other than just dropping rows, you can also drop columns with null
values by setting axis=1:
In [102]: movies_df.dropna(axis=1)
Out[102]:
rank genre
Title
Guardians of
1 Action,Adventure,Sci-Fi
the Galaxy
Prometheus 2 Adventure,Mystery,Sci-Fi
Split 3 Horror,Thriller
Sing 4 Animation,Comedy,Family
Suicide
5 Action,Adventure,Fantasy
Squad
The Great
6 Action,Adventure,Fantasy
Wall
La La Land 7 Comedy,Drama,Music
Mindhorn 8 Comedy
Passengers 10 Adventure,Drama,Romance
Fantastic
Beasts and
11 Adventure,Family,Fantasy
Where to Find
Them
Hidden
12 Biography,Drama,History
Figures
Moana 14 Animation,Adventure,Comedy
Colossal 15 Action,Comedy,Drama
The Secret
16 Animation,Adventure,Comedy
Life of Pets
Hacksaw
17 Biography,Drama,History
Ridge
Lion 19 Biography,Drama
Arrival 20 Drama,Mystery,Sci-Fi
Gold 21 Adventure,Drama,Thriller
Manchester
22 Drama
by the Sea
Hounds of
23 Crime,Drama,Horror
Love
Trolls 24 Animation,Adventure,Comedy
Independence
Day: 25 Action,Adventure,Sci-Fi
Resurgence
Paris pieds
26 Comedy
nus
Bahubali: The
27 Action,Adventure,Drama
Beginning
Assassin's
30 Action,Adventure,Drama
Creed
Texas
971 Horror,Thriller
Chainsaw 3D
Queen of
975 Biography,Drama,Sport
Katwe
My Big Fat
Greek 976 Comedy,Family,Romance
Wedding 2
Amateur
978 Comedy
Night
The Skin I
https://github.com/LearnDataSci/articles/blob/master/Python Pandas Tutorial A Complete Introduction for Beginners/notebook.ipynb 26/41
28/12/2020 articles/notebook.ipynb at master · LearnDataSci/articles · GitHub
The Skin I
980 Drama,Thriller
Live In
Miracles from
981 Biography,Drama,Family
Heaven
Across the
983 Drama,Fantasy,Musical
Universe
Your
986 Adventure,Comedy,Fantasy
Highness
Final
987 Horror,Thriller
Destination 5
Underworld:
Rise of the 991 Action,Adventure,Fantasy
Lycans
Taare Zameen
992 Drama,Family,Music
Par
Take Me
993 Comedy,Drama,Romance
Home Tonight
Resident Evil:
994 Action,Adventure,Horror
Afterlife
Secret in
996 Crime,Drama,Mystery
Their Eyes
Step Up 2:
998 Drama,Music,Romance
The Streets
It's not immediately obvious where axis comes from and why you
need it to be 1 for it to affect columns. To see why, just look at the
.shape output:
In [103]: movies_df.shape
Imputation
Imputation is a conventional feature engineering technique used to
keep valuable data that have null values.
There may be instances where dropping every row with a null value
removes too big a chunk from your dataset, so instead we can
impute that null with another value, usually the mean or the median
of that column.
In [105]: revenue.head()
Out[105]: Title
Guardians of the Galaxy 333.13
Prometheus 126.46
Split 138.12
Si 270 32
https://github.com/LearnDataSci/articles/blob/master/Python Pandas Tutorial A Complete Introduction for Beginners/notebook.ipynb 29/41
28/12/2020 articles/notebook.ipynb at master · LearnDataSci/articles · GitHub
Sing 270.32
Suicide Squad 325.02
Name: revenue_millions, dtype: float64
We'll impute the missing values of revenue using the mean. Here's
the mean value:
revenue_mean
Out[107]: 82.95637614678897
We have now replaced all nulls in revenue with the mean of the
column. Notice that by using inplace=True we have actually
affected the original movies_df:
In [114]: movies_df.isnull().sum()
Out[114]: rank 0
genre 0
description 0
director 0
actors 0
year 0
runtime 0
rating 0
votes 0
revenue_millions 0
metascore 64
dtype: int64
Imputing an entire column with the same value like this is a basic
example. It would be a better idea to try a more granular imputation
by Genre or Director.
For example, you would find the mean of the revenue generated in
each genre individually and impute the nulls in each genre with that
genre's mean.
In [115]: movies_df.describe()
Out[115]:
rank year runtime ra
In [116]: movies_df['genre'].describe()
This tells us that the genre column has 207 unique values, the top
value is Action/Adventure/Sci-Fi, which shows up 50 times (freq).
In [119]: movies_df['genre'].value_counts().head(10)
Out[119]: Action,Adventure,Sci-Fi 50
Drama 48
Comedy,Drama,Romance 35
Comedy 32
Drama,Romance 31
Action,Adventure,Fantasy 27
Comedy,Drama 27
Animation,Adventure,Comedy 27
Comedy,Romance 26
Crime,Drama,Thriller 24
Name: genre, dtype: int64
In [120]: movies_df.corr()
So looking in the first row, first column we see rank has a perfect
correlation with itself, which is obvious. On the other hand, the
correlation between votes and revenue_millions is 0.6. A little
more interesting.
It's important to note that, although many methods are the same,
DataFrames and Series have different attributes, so you'll need be
sure to know which type you are working with or else you will
receive attribute errors.
By column
You already saw how to extract a column using square brackets like
this:
type(genre_col)
Out[125]: pandas.core.series.Series
type(genre_col)
Out[126]: pandas.core.frame.DataFrame
subset.head()
Title
Guardians of
Action,Adventure,Sci-Fi 8.1
the Galaxy
By rows
prom
Out[128]: rank
2
genre
Adventure,Mystery,Sci-Fi
description Following clues to the
origin of mankind, a te...
director
Ridley Scott
actors Noomi Rapace, Logan Mar
shall-Green, Michael Fa...
year
2012
runtime
124
rating
7
votes
485820
revenue_millions
126.46
metascore
65
Name: Prometheus, dtype: object
How would you do it with a list? In Python, just slice with brackets
like example_list[1:4]. It's works the same way in pandas:
movie_subset = movies_df.iloc[1:4]
movie_subset
Title
Follow
https://github.com/LearnDataSci/articles/blob/master/Python Pandas Tutorial A Complete Introduction for Beginners/notebook.ipynb 34/41