Module - 3 New
Module - 3 New
print df
Pandas object can be split into any of their objects. There are multiple
ways to split an object like −
obj.groupby('key')
obj.groupby(['key1','key2'])
obj.groupby(key,axis=1)
Let us now see how the grouping objects can be applied to the
DataFrame object
import pandas as pd
Print(df.groupby('Team'))
View Groups
# import the pandas library
import pandas as pd
ipl_data = {'Team': ['Riders', 'Riders', 'Devils', 'Devils', 'Kings',
'kings', 'Kings', 'Kings', 'Riders', 'Royals', 'Royals', 'Riders'],
'Rank': [1, 2, 2, 3, 3,4 ,1 ,1,2 , 4,1,2],
'Year':
[2014,2015,2014,2015,2014,2015,2016,2017,2016,2014,2015,2017],
'Points':[876,789,863,673,741,812,756,788,694,701,804,690]}
df = pd.DataFrame(ipl_data)
print(df.groupby('Team').groups)
Group by with multiple columns −
import pandas as pd
print df.groupby(['Team','Year']).groups
Iterating through Groups
# import the pandas library
import pandas as pd
ipl_data = {'Team': ['Riders', 'Riders', 'Devils', 'Devils', 'Kings',
'kings', 'Kings', 'Kings', 'Riders', 'Royals', 'Royals', 'Riders'],
'Rank': [1, 2, 2, 3, 3,4 ,1 ,1,2 , 4,1,2],
'Year': [2014,2015,2014,2015,2014,2015,2016,2017,2016,2014,2015,2017],
'Points':[876,789,863,673,741,812,756,788,694,701,804,690]}
df = pd.DataFrame(ipl_data)
grouped = df.groupby('Year')
for name,group in grouped:
print name
print group
By default, the groupby object has the same label name as the group name.
Select a Group
Using the get_group() method, we can select a single group.
# import the pandas library
import pandas as pd
ipl_data = {'Team': ['Riders', 'Riders', 'Devils', 'Devils', 'Kings',
'kings', 'Kings', 'Kings', 'Riders', 'Royals', 'Royals', 'Riders'],
'Rank': [1, 2, 2, 3, 3,4 ,1 ,1,2 , 4,1,2],
'Year':
[2014,2015,2014,2015,2014,2015,2016,2017,2016,2014,2015,2017],
'Points':[876,789,863,673,741,812,756,788,694,701,804,690]}
df = pd.DataFrame(ipl_data)
grouped = df.groupby('Year')
print grouped.get_group(2014)
Aggregations
An aggregated function returns a single aggregated value for each group. Once
the group by object is created, several aggregation operations can be
performed on the grouped data.
An obvious one is aggregation via the aggregate or equivalent agg method −
import pandas as pd
import numpy as np
ipl_data = {'Team': ['Riders', 'Riders', 'Devils', 'Devils', 'Kings',
'kings', 'Kings', 'Kings', 'Riders', 'Royals', 'Royals', 'Riders'],
'Rank': [1, 2, 2, 3, 3,4 ,1 ,1,2 , 4,1,2],
'Year': [2014,2015,2014,2015,2014,2015,2016,2017,2016,2014,2015,2017],
'Points':[876,789,863,673,741,812,756,788,694,701,804,690]}
df = pd.DataFrame(ipl_data)
grouped = df.groupby('Year')
print grouped['Points'].agg(np.mean)
Aggregating functions are the ones that reduce the dimension of the returned objects.
Some common aggregating functions are tabulated below:
Function Description
mean() Compute mean of groups
sum() Compute sum of group values
size() Compute group sizes
count() Compute count of group
std() Standard deviation of groups
var() Compute variance of groups
sem() Standard error of the mean of groups
describe() Generates descriptive statistics
first() Compute first of group values
last() Compute last of group values
nth() Take nth value, or a subset if n is a list
min() Compute min of group values
max() Compute max of group values
import pandas as pd
import numpy as np
grouped = df.groupby('Team')
print grouped['Points'].agg([np.sum, np.mean, np.std])
Transformation
It is an operation on a group or column that performs some group-
specific computation and returns an object that is indexed with the
same size as of the group size.
# import the pandas library
import pandas as pd
import numpy as np
ipl_data = {'Team': ['Riders', 'Riders', 'Devils', 'Devils', 'Kings',
'kings', 'Kings', 'Kings', 'Riders', 'Royals', 'Royals', 'Riders'],
'Rank': [1, 2, 2, 3, 3,4 ,1 ,1,2 , 4,1,2],
'Year': [2014,2015,2014,2015,2014,2015,2016,2017,2016,2014,2015,2017],
'Points':[876,789,863,673,741,812,756,788,694,701,804,690]}
df = pd.DataFrame(ipl_data)
grouped = df.groupby('Team')
score = lambda x: (x - x.mean()) / x.std()*10
print grouped.transform(score)
A lambda function is a small anonymous function.
A lambda function can take any number of arguments, but can only
have one expression.
Syntax
lambda arguments : expression
The expression is executed and the result is returned:
Add 10 to argument a, and return the result:
x = lambda a : a + 10
print(x(5))
x = lambda a, b : a * b
print(x(5, 6))
Pivot Table
Pivot table lets you calculate, summarize and aggregate your data. MS
Excel has this feature built-in and provides an elegant way to create the
pivot table from data. its a powerful tool that allows you to aggregate
the data with calculations such as Sum, Count, Average, Max, and Min.
and also configure the rows and columns for the pivot table and apply
any filters and sort orders to the data once pivot table has been
created.Coming to Python, Pandas has a feature to build Pivot table and
Crosstab using the Dataframe or list of Data.
Lets create a dataframe of different ecommerce site and their monthly sales in
different Category
import pandas as pd
import numpy as np
df = pd.DataFrame({'site' : ['walmart', 'amazon', 'alibaba',
'flipkart','alibaba','flipkart','walmart', 'amazon', 'alibaba', 'flipkart'],
'Product_Category' : ['Kitchen', 'Home-Decor', 'Gardening', 'Health',
'Beauty', 'Garments',
'Gardening', 'Health', 'Beauty', 'Garments'] ,
'Product' : ['Oven','Sofa-set','digging spade','fitness
band','sunscreen','pyjamas','digging spade',
'fitness band','sunscreen','pyjamas'],
'Sales' : [2000,3000,4000,5000,6000,9000,3000,2500,1020,950]})
Print(df)
There are 4 sites and 6 different product category. We will now use this
data to create the Pivot table. Before using the pandas pivot table
feature we have to ensure the dataframe is created.
Create Pivot Table
df.pivot_table( index=['Product_Category', 'Product'], values=['Sales'],
columns=['site'])
attribute index is the list of rows in data and columns is the columns for
the rows for which you want to see the Sales data i.e. values. So here
we want to see the Product Category and Product and their sales data
for each of the sites as column.
By default the aggreggate function is mean.
Pandas Pivot Table Aggfunc
Lets us see another attribute aggfunc where you can add one or list of
functions so we have seen if you dont mention this param explicitly
then default func is mean. Now lets check another aggfunc i.e.
sum,min,max,count etc.
df.pivot_table( index=['Product_Category', 'Product'], values=['Sales'],
columns=['site'], aggfunc=min)
List of Aggfunc
Let us add two aggfunc in a list i.e. min and sum
df.pivot_table( index=['Product_Category', 'Product'], values=['Sales'],
columns=['site'], aggfunc=[min,sum])
Pandas Crosstabs
In the example we take the value of stock prices every day for a quarter
for a particular stock symbol. We capture these values as a csv file and
then organize them to a dataframe using pandas library. We then set the
date field as index of the dataframe by recreating the additional
Valuedate column as index and deleting the old valuedate column.
Sample Data
Below is the sample data for the price of the stock on different days of a given quarter. The data is
saved in a file named as stock.csv
ValueDate Price
01-01-2018, 1042.05
02-01-2018, 1033.55
03-01-2018, 1029.7
04-01-2018, 1021.3
05-01-2018, 1015.4
...
...
...
...
23-03-2018, 1161.3
26-03-2018, 1167.6
27-03-2018, 1155.25
28-03-2018, 1154
Creating Time Series
from datetime import datetime
import pandas as pd
import matplotlib.pyplot as plt
data = pd.read_csv('path_to_file/stock.csv')
df = pd.DataFrame(data, columns = ['ValueDate', 'Price'])
# Set the Date as Index
df['ValueDate'] = pd.to_datetime(df['ValueDate'])
df.index = df['ValueDate']
del df['ValueDate']
df.plot(figsize=(15, 6))
plt.show()
Output
Python Basic date and time types
To manipulate dates and times in the python there is a module called datetime. There are two types
of date and time objects. The types are naïve and the aware.
In the naïve object, there is no enough information to unambiguously locate this object from other
date-time objects. In this approach it uses Coordinate Universal Time (UTC).
In the aware type objects there are different information regarding algorithmic and political time
adjustments. This type of objects is used to represent some specific time moments.
To use this module, we should import it using −
import datetime
There are different classes, constants and methods in this module.
The constants are −
datetime.MINYEAR
It is the smallest Year number, which can be applied as date or datetime
objects. The value is 0
datetime.MAXYEAR
It is the largest Year number, which can be applied as date or datetime
objects. The value is 9999
The Available datatypes are −
date
It is date type object. It uses Gregorian calendar. It has year, month, day attributes.
time
It is a time object class. It is independent of any particular day. It has hour, minute, second,
microsecond and tzinfo attributes.
datetime
It is a combined set of dates and times.
timedelta
It is used to express the difference between two date, time or datetime values in milliseconds.
tzinfo
It is an Abstract Base Class. It holds the time zone information. It is used by the datetime and time
classes.
timezone
In this class, it implements tzinfo. There is a fixed offset from the UTC
Date Type Object
The date objects represent a date. In the date there are Day, month and the Year part. It uses the Gregorian
Calendar. According to this calendar the day of January 1 of Year 1 is called as the day number 1, and so on.
Some date related methods are −
Method date.date(year, month, day)
This is the constructor to create a date type object. To create a date, all arguments are required as integer type
data. The year must be in range MINYEAR & MAXYEAR. If the given date is not valid, it will raise ValueError.
Method date.today()
This method is used to return the current local date.
Method date.fromtimestamp(timestamp)
This method is used to get the date from POSIX timestamp. If the timestamp value is out of range, it will raise
OverflowError.
Method date.fromordinal(ordinal)
This method is used to get the date from proleptic Gregorian Calendar ordinal. It is used to get the date from
the date count from January 1 of Year 1.
Method date.toordinal()
This method is used to return a date to proleptic Gregorian Calendar ordinal.
Method date.weekday()
This method is used to return the date of a week as an integer from the date. The Monday is 0, Tuesday is 1
and so on.
Method date.isoformat()
import datetime as dt
new_date = dt.date(1998, 9, 5) #Store date 5th septemberm, 1998
print("The Date is: " + str(new_date))
print("Ordinal value of given date: " + str(new_date.toordinal()))
print("The weekday of the given date: " + str(new_date.weekday()))
#Monday is 0
my_date = dt.date.fromordinal(732698) #Create a date from the
Ordinal value.
print("The Date from ordinal is: " + str(my_date))
td = my_date - new_date
#Create a timedelta object
print('td Type: ' + str(type(td)) + '\nDifference: ' + str(td))