0% found this document useful (0 votes)
5 views38 pages

Module - 3 New

The document discusses data aggregation and grouping operations in pandas, highlighting the importance of the GroupBy method for summarizing and aggregating data efficiently. It covers various functions such as describe(), unique(), and methods for aggregating data like min, max, sum, and standard deviation. Additionally, it explains how to create pivot tables for summarizing data in a structured format.

Uploaded by

Muhammed adhil
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
5 views38 pages

Module - 3 New

The document discusses data aggregation and grouping operations in pandas, highlighting the importance of the GroupBy method for summarizing and aggregating data efficiently. It covers various functions such as describe(), unique(), and methods for aggregating data like min, max, sum, and standard deviation. Additionally, it explains how to create pivot tables for summarizing data in a structured format.

Uploaded by

Muhammed adhil
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 38

Module - 3

Data Aggregation, Group operations


Data Aggregation
What is data aggregation in pandas?
Aggregating Data with Pandas
Data aggregation is the process of gathering data and expressing it in a
summary form. This typically corresponds to summary statistics for
numerical and categorical variables in a data set.
Pandas Groupby: Summarising, Aggregating,
and Grouping data in Python
GroupBy is a simple concept. We can create a grouping of categories and apply
a function to the categories. It’s a simple concept, but it’s an extremely
valuable technique that’s widely used in data science. In real data science
projects, you’ll be dealing with large amounts of data and trying things over
and over, so for efficiency, we use Groupby concept. Groupby concept is really
important because of its ability to summarize, aggregate, and group data
efficiently.
Summarize
Summarization includes counting, describing all the data present in data frame.
We can summarize the data present in the data frame using describe()
method. This method is used to get min, max, sum, count values from the data
frame along with data types of that particular column.
describe(): This method elaborates the type of data and its attributes.
Syntax:
dataframe_name.describe()
unique(): This method is used to get all unique values from the given column.
Syntax:
dataframe[‘column_name].unique()
nunique(): This method is similar to unique but it will return the count of unique
values.
Syntax:
dataframe_name[‘column_name].nunique()
info(): This command is used to get the data types and columns information
Syntax:
dataframe.info()
columns: This command is used to display all the column names present in data frame
Syntax:
dataframe.columns
Example:
We are going to analyze the student marks data in this example.
# importing pandas as pd for using data frame
import pandas as pd
# creating dataframe with student details
dataframe = pd.DataFrame({'id': [7058, 4511, 7014, 7033],
'name': ['sravan', 'manoj', 'aditya', 'bhanu'],
'Maths_marks': [99, 97, 88, 90],
'Chemistry_marks': [89, 99, 99, 90],
'telugu_marks': [99, 97, 88, 80],
'hindi_marks': [99, 97, 56, 67],
'social_marks': [79, 97, 78, 90], })
print(dataframe)
# describing the data frame
print(dataframe.describe())
print("-----------------------------")
# finding unique values
print(dataframe['Maths_marks'].unique())
print("-----------------------------")
# counting unique values
print(dataframe['Maths_marks'].nunique())
print("-----------------------------")
# display the columns in the data frame
print(dataframe.columns)
print("-----------------------------")
# information about dataframe
print(dataframe.info())
In the below program we will aggregate data.
# getting all minimum values from all columns in a dataframe
print(dataframe.min())
print("-----------------------------------------")
# minimum value from a particular column in a data frame
print(dataframe['Maths_marks'].min())
print("-----------------------------------------")
# computing maximum values
print(dataframe.max())
print("-----------------------------------------")
# computing sum
print(dataframe.sum())
print("-----------------------------------------")
# finding count
print(dataframe.count())
print("-----------------------------------------")
# computing standard deviation
print(dataframe.std())
print("-----------------------------------------")
# computing variance
print(dataframe.var())
Grouping
It is used to group one or more columns in a dataframe by using the
groupby() method. Groupby mainly refers to a process involving one or
more of the following steps they are:
Splitting: It is a process in which we split data into group by applying
some conditions on datasets.
Applying: It is a process in which we apply a function to each group
independently
Combining: It is a process in which we combine different datasets
after applying groupby and results in a data structure
In many situations, we split the data into sets and we apply some
functionality on each subset. In the apply functionality, we can perform
the following operations −

Aggregation − computing a summary statistic

Transformation − perform some group-specific operation

Filtration − discarding the data with some condition


#import the pandas library
import pandas as pd
ipl_data = {'Team': ['Riders', 'Riders', 'Devils', 'Devils', 'Kings',
'kings', 'Kings', 'Kings', 'Riders', 'Royals', 'Royals', 'Riders'],
'Rank': [1, 2, 2, 3, 3,4 ,1 ,1,2 , 4,1,2],
'Year': [2014,2015,2014,2015,2014,2015,2016,2017,2016,2014,2015,2017],
'Points':[876,789,863,673,741,812,756,788,694,701,804,690]}
df = pd.DataFrame(ipl_data)

print df
Pandas object can be split into any of their objects. There are multiple
ways to split an object like −
obj.groupby('key')
obj.groupby(['key1','key2'])
obj.groupby(key,axis=1)
Let us now see how the grouping objects can be applied to the
DataFrame object
import pandas as pd

ipl_data = {'Team': ['Riders', 'Riders', 'Devils', 'Devils', 'Kings',


'kings', 'Kings', 'Kings', 'Riders', 'Royals', 'Royals', 'Riders'],
'Rank': [1, 2, 2, 3, 3,4 ,1 ,1,2 , 4,1,2],
'Year': [2014,2015,2014,2015,2014,2015,2016,2017,2016,2014,2015,2017],
'Points':[876,789,863,673,741,812,756,788,694,701,804,690]}
df = pd.DataFrame(ipl_data)

Print(df.groupby('Team'))
View Groups
# import the pandas library
import pandas as pd
ipl_data = {'Team': ['Riders', 'Riders', 'Devils', 'Devils', 'Kings',
'kings', 'Kings', 'Kings', 'Riders', 'Royals', 'Royals', 'Riders'],
'Rank': [1, 2, 2, 3, 3,4 ,1 ,1,2 , 4,1,2],
'Year':
[2014,2015,2014,2015,2014,2015,2016,2017,2016,2014,2015,2017],
'Points':[876,789,863,673,741,812,756,788,694,701,804,690]}
df = pd.DataFrame(ipl_data)
print(df.groupby('Team').groups)
Group by with multiple columns −
import pandas as pd

ipl_data = {'Team': ['Riders', 'Riders', 'Devils', 'Devils', 'Kings',


'kings', 'Kings', 'Kings', 'Riders', 'Royals', 'Royals', 'Riders'],
'Rank': [1, 2, 2, 3, 3,4 ,1 ,1,2 , 4,1,2],
'Year':
[2014,2015,2014,2015,2014,2015,2016,2017,2016,2014,2015,2017],
'Points':[876,789,863,673,741,812,756,788,694,701,804,690]}
df = pd.DataFrame(ipl_data)

print df.groupby(['Team','Year']).groups
Iterating through Groups
# import the pandas library
import pandas as pd
ipl_data = {'Team': ['Riders', 'Riders', 'Devils', 'Devils', 'Kings',
'kings', 'Kings', 'Kings', 'Riders', 'Royals', 'Royals', 'Riders'],
'Rank': [1, 2, 2, 3, 3,4 ,1 ,1,2 , 4,1,2],
'Year': [2014,2015,2014,2015,2014,2015,2016,2017,2016,2014,2015,2017],
'Points':[876,789,863,673,741,812,756,788,694,701,804,690]}
df = pd.DataFrame(ipl_data)
grouped = df.groupby('Year')
for name,group in grouped:
print name
print group
By default, the groupby object has the same label name as the group name.
Select a Group
Using the get_group() method, we can select a single group.
# import the pandas library
import pandas as pd
ipl_data = {'Team': ['Riders', 'Riders', 'Devils', 'Devils', 'Kings',
'kings', 'Kings', 'Kings', 'Riders', 'Royals', 'Royals', 'Riders'],
'Rank': [1, 2, 2, 3, 3,4 ,1 ,1,2 , 4,1,2],
'Year':
[2014,2015,2014,2015,2014,2015,2016,2017,2016,2014,2015,2017],
'Points':[876,789,863,673,741,812,756,788,694,701,804,690]}
df = pd.DataFrame(ipl_data)
grouped = df.groupby('Year')
print grouped.get_group(2014)
Aggregations
An aggregated function returns a single aggregated value for each group. Once
the group by object is created, several aggregation operations can be
performed on the grouped data.
An obvious one is aggregation via the aggregate or equivalent agg method −
import pandas as pd
import numpy as np
ipl_data = {'Team': ['Riders', 'Riders', 'Devils', 'Devils', 'Kings',
'kings', 'Kings', 'Kings', 'Riders', 'Royals', 'Royals', 'Riders'],
'Rank': [1, 2, 2, 3, 3,4 ,1 ,1,2 , 4,1,2],
'Year': [2014,2015,2014,2015,2014,2015,2016,2017,2016,2014,2015,2017],
'Points':[876,789,863,673,741,812,756,788,694,701,804,690]}
df = pd.DataFrame(ipl_data)
grouped = df.groupby('Year')
print grouped['Points'].agg(np.mean)
Aggregating functions are the ones that reduce the dimension of the returned objects.
Some common aggregating functions are tabulated below:
Function Description
mean() Compute mean of groups
sum() Compute sum of group values
size() Compute group sizes
count() Compute count of group
std() Standard deviation of groups
var() Compute variance of groups
sem() Standard error of the mean of groups
describe() Generates descriptive statistics
first() Compute first of group values
last() Compute last of group values
nth() Take nth value, or a subset if n is a list
min() Compute min of group values
max() Compute max of group values
import pandas as pd
import numpy as np

ipl_data = {'Team': ['Riders', 'Riders', 'Devils', 'Devils', 'Kings',


'kings', 'Kings', 'Kings', 'Riders', 'Royals', 'Royals', 'Riders'],
'Rank': [1, 2, 2, 3, 3,4 ,1 ,1,2 , 4,1,2],
'Year':
[2014,2015,2014,2015,2014,2015,2016,2017,2016,2014,2015,2017],
'Points':[876,789,863,673,741,812,756,788,694,701,804,690]}
df = pd.DataFrame(ipl_data)
Attribute Access in Python Pandas
grouped = df.groupby('Team')
print grouped.agg(np.size)
Applying Multiple Aggregation Functions at Once
With grouped Series, you can also pass a list or dict of functions to do
aggregation with, and generate DataFrame as output −
import pandas as pd
import numpy as np
ipl_data = {'Team': ['Riders', 'Riders', 'Devils', 'Devils', 'Kings',
'kings', 'Kings', 'Kings', 'Riders', 'Royals', 'Royals', 'Riders'],
'Rank': [1, 2, 2, 3, 3,4 ,1 ,1,2 , 4,1,2],
'Year': [2014,2015,2014,2015,2014,2015,2016,2017,2016,2014,2015,2017],
'Points':[876,789,863,673,741,812,756,788,694,701,804,690]}
df = pd.DataFrame(ipl_data)

grouped = df.groupby('Team')
print grouped['Points'].agg([np.sum, np.mean, np.std])
Transformation
It is an operation on a group or column that performs some group-
specific computation and returns an object that is indexed with the
same size as of the group size.
# import the pandas library
import pandas as pd
import numpy as np
ipl_data = {'Team': ['Riders', 'Riders', 'Devils', 'Devils', 'Kings',
'kings', 'Kings', 'Kings', 'Riders', 'Royals', 'Royals', 'Riders'],
'Rank': [1, 2, 2, 3, 3,4 ,1 ,1,2 , 4,1,2],
'Year': [2014,2015,2014,2015,2014,2015,2016,2017,2016,2014,2015,2017],
'Points':[876,789,863,673,741,812,756,788,694,701,804,690]}
df = pd.DataFrame(ipl_data)
grouped = df.groupby('Team')
score = lambda x: (x - x.mean()) / x.std()*10
print grouped.transform(score)
A lambda function is a small anonymous function.
A lambda function can take any number of arguments, but can only
have one expression.
Syntax
lambda arguments : expression
The expression is executed and the result is returned:
Add 10 to argument a, and return the result:
x = lambda a : a + 10
print(x(5))

x = lambda a, b : a * b
print(x(5, 6))
Pivot Table
Pivot table lets you calculate, summarize and aggregate your data. MS
Excel has this feature built-in and provides an elegant way to create the
pivot table from data. its a powerful tool that allows you to aggregate
the data with calculations such as Sum, Count, Average, Max, and Min.
and also configure the rows and columns for the pivot table and apply
any filters and sort orders to the data once pivot table has been
created.Coming to Python, Pandas has a feature to build Pivot table and
Crosstab using the Dataframe or list of Data.
Lets create a dataframe of different ecommerce site and their monthly sales in
different Category

import pandas as pd
import numpy as np
df = pd.DataFrame({'site' : ['walmart', 'amazon', 'alibaba',
'flipkart','alibaba','flipkart','walmart', 'amazon', 'alibaba', 'flipkart'],
'Product_Category' : ['Kitchen', 'Home-Decor', 'Gardening', 'Health',
'Beauty', 'Garments',
'Gardening', 'Health', 'Beauty', 'Garments'] ,
'Product' : ['Oven','Sofa-set','digging spade','fitness
band','sunscreen','pyjamas','digging spade',
'fitness band','sunscreen','pyjamas'],
'Sales' : [2000,3000,4000,5000,6000,9000,3000,2500,1020,950]})
Print(df)
There are 4 sites and 6 different product category. We will now use this
data to create the Pivot table. Before using the pandas pivot table
feature we have to ensure the dataframe is created.
Create Pivot Table
df.pivot_table( index=['Product_Category', 'Product'], values=['Sales'],
columns=['site'])
attribute index is the list of rows in data and columns is the columns for
the rows for which you want to see the Sales data i.e. values. So here
we want to see the Product Category and Product and their sales data
for each of the sites as column.
By default the aggreggate function is mean.
Pandas Pivot Table Aggfunc
Lets us see another attribute aggfunc where you can add one or list of
functions so we have seen if you dont mention this param explicitly
then default func is mean. Now lets check another aggfunc i.e.
sum,min,max,count etc.
df.pivot_table( index=['Product_Category', 'Product'], values=['Sales'],
columns=['site'], aggfunc=min)
List of Aggfunc
Let us add two aggfunc in a list i.e. min and sum
df.pivot_table( index=['Product_Category', 'Product'], values=['Sales'],
columns=['site'], aggfunc=[min,sum])
Pandas Crosstabs

Its a tabular structure showing relationship between different variables.


The Pandas crosstab and pivot has not much difference it works almost
the same way. The only difference is Crosstab works with Series or list of
Variables whereas Pivot works with dataframe and internally crosstab
calls pivot table function. So when you have list of data or a Series then
you should use crosstab and if there is data available in a dataframe then
you should go for pivot table.
Lets take the same above dataframe and apply those same use cases
using crosstab. Here the default aggrfunc is count which means it finds
the frequency of each of the row and respective column
pd.crosstab([df.Product_Category,df.Product],df.site)
Crosstab Rownames and Column Names
Lets change the row and column names using these two attibutes
rownames and colnames. Let the Product_Category as PC, Product as P
and Sales as S
pd.crosstab([df.Product_Category,df.Product],df.site,rownames=['PC','P'
],colnames=['S'])
Crosstab Aggfunc
pd.crosstab([df.Product_Category,df.Product],df.site,values=df.Sales,aggfunc=sum,rownames=['
PC','P'],colnames=['S'])
List of Aggfunc
Lets take list of aggfunc i.e. sum, min, All these functions are stored in list and passed in aggfunc
pd.crosstab([df.Product_Category,df.Product],df.site,values=df.Sales,aggfunc=[sum,min],rowna
mes=['PC','P'],colnames=['S'])
Python - Time Series

Time series is a series of data points in which each data point is


associated with a timestamp. A simple example is the price of a stock in
the stock market at different points of time on a given day. Another
example is the amount of rainfall in a region at different months of the
year.

In the example we take the value of stock prices every day for a quarter
for a particular stock symbol. We capture these values as a csv file and
then organize them to a dataframe using pandas library. We then set the
date field as index of the dataframe by recreating the additional
Valuedate column as index and deleting the old valuedate column.
Sample Data
Below is the sample data for the price of the stock on different days of a given quarter. The data is
saved in a file named as stock.csv

ValueDate Price
01-01-2018, 1042.05
02-01-2018, 1033.55
03-01-2018, 1029.7
04-01-2018, 1021.3
05-01-2018, 1015.4
...
...
...
...
23-03-2018, 1161.3
26-03-2018, 1167.6
27-03-2018, 1155.25
28-03-2018, 1154
Creating Time Series
from datetime import datetime
import pandas as pd
import matplotlib.pyplot as plt
data = pd.read_csv('path_to_file/stock.csv')
df = pd.DataFrame(data, columns = ['ValueDate', 'Price'])
# Set the Date as Index
df['ValueDate'] = pd.to_datetime(df['ValueDate'])
df.index = df['ValueDate']
del df['ValueDate']
df.plot(figsize=(15, 6))
plt.show()
Output
Python Basic date and time types

To manipulate dates and times in the python there is a module called datetime. There are two types
of date and time objects. The types are naïve and the aware.
In the naïve object, there is no enough information to unambiguously locate this object from other
date-time objects. In this approach it uses Coordinate Universal Time (UTC).
In the aware type objects there are different information regarding algorithmic and political time
adjustments. This type of objects is used to represent some specific time moments.
To use this module, we should import it using −

import datetime
There are different classes, constants and methods in this module.
The constants are −
datetime.MINYEAR
It is the smallest Year number, which can be applied as date or datetime
objects. The value is 0
datetime.MAXYEAR
It is the largest Year number, which can be applied as date or datetime
objects. The value is 9999
The Available datatypes are −
date
It is date type object. It uses Gregorian calendar. It has year, month, day attributes.
time
It is a time object class. It is independent of any particular day. It has hour, minute, second,
microsecond and tzinfo attributes.
datetime
It is a combined set of dates and times.
timedelta
It is used to express the difference between two date, time or datetime values in milliseconds.
tzinfo
It is an Abstract Base Class. It holds the time zone information. It is used by the datetime and time
classes.
timezone
In this class, it implements tzinfo. There is a fixed offset from the UTC
Date Type Object
The date objects represent a date. In the date there are Day, month and the Year part. It uses the Gregorian
Calendar. According to this calendar the day of January 1 of Year 1 is called as the day number 1, and so on.
Some date related methods are −
Method date.date(year, month, day)
This is the constructor to create a date type object. To create a date, all arguments are required as integer type
data. The year must be in range MINYEAR & MAXYEAR. If the given date is not valid, it will raise ValueError.
Method date.today()
This method is used to return the current local date.
Method date.fromtimestamp(timestamp)
This method is used to get the date from POSIX timestamp. If the timestamp value is out of range, it will raise
OverflowError.
Method date.fromordinal(ordinal)
This method is used to get the date from proleptic Gregorian Calendar ordinal. It is used to get the date from
the date count from January 1 of Year 1.
Method date.toordinal()
This method is used to return a date to proleptic Gregorian Calendar ordinal.
Method date.weekday()
This method is used to return the date of a week as an integer from the date. The Monday is 0, Tuesday is 1
and so on.
Method date.isoformat()
import datetime as dt
new_date = dt.date(1998, 9, 5) #Store date 5th septemberm, 1998
print("The Date is: " + str(new_date))
print("Ordinal value of given date: " + str(new_date.toordinal()))
print("The weekday of the given date: " + str(new_date.weekday()))
#Monday is 0
my_date = dt.date.fromordinal(732698) #Create a date from the
Ordinal value.
print("The Date from ordinal is: " + str(my_date))
td = my_date - new_date
#Create a timedelta object
print('td Type: ' + str(type(td)) + '\nDifference: ' + str(td))

You might also like