Important Data Formatting Methods
(merge, sort, reset_index, fillna)
Let’s start with
our zoo dataset,
Againn..
Pandas Merge
(a.k.a. “joining” dataframes)
let’s say that we have another dataframe, zoo_eats,
that contains information about the food requirements
for each species.
zoo_eats.csv data:
Make animal,food
elephant,vegetables
zoo_eats.csv tiger,meat
kangaroo,vegetables
File zebra,meat
girrafe, vegetables
Loading the data:
zoo_eats = pd.read_csv('zoo_eats.csv')
zoo_eats
Let’s merge
these two pandas
dataframes
zoo.merge(zoo_eats)
Try:
zoo_eats.merge(zoo) zoo.merge(zoo_eats)
Is it the same?
where are
all the
lions?
Calmdown… Lion akan kembalii
Pandas Merge…
But how?
Inner, outer, left or right?
HOW YOU WANT
TO MERGE?
Let’s try this:
zoo.merge(zoo_eats, 'outer')
See?
Lions came back…
the giraffe came back…
Let’s try this too:
zoo.merge(zoo_eats,‘left')
Sorting in Pandas
Sorts In Ascending Order
zoo.sort_values(by=['water_need’])
Sorts In Descending Order
zoo.sort_values('water_need’, ascending = False)
sort by multiple columns
zoo.sort_values(by = ['animal', 'water_need'])
Reset Index
wrong indexing can mess up your visualizations or even
your machine learning models
Reset Index
zoo.sort_values(by = ['water_need'], ascending = False).reset_index()
As you can see, our new dataframe kept the old
indexes, too. If you want to remove them, just add
the drop = True
Reset Index
Fillna
Note: fillna is basically fill + na in one world.
Let’s rerun the left-merge method that we have used
above:
zoo.merge(zoo_eats, 'left')
The problem is that we
have NaN values for lions.
Let’s replace it with something more
meaningful
zoo.merge(zoo_eats, how = 'left').fillna('unknown')
Let’s get back to our article_read dataset
this dataset holds the data of a travel blog
Download another data from:
46.101.230.157/dilan/pandas_tutorial_buy.csv
And name it as blog_buy
Thera 4 variable in blog_buy, mame those
variable respectively as ‘my_date_time',
'event', 'user_id’ and 'amount'
Test Your Self #1!
Merge article read and blog_buy
Test Your Self #2!
What’s the average (mean) revenue
between 2018-01-01 and 2018-01-07 from
the users in the article_read dataframe?
Test Your Self #3!
Print the top 3 countries by total revenue
between 2018-01-01 and 2018-01-07! (Obviously,
this concerns the users in
the article_read dataframe again.)