Data Aggregation and Group Operations
Data Aggregation and Group Operations
and Group
Operations
Data Aggregation and Group By
• Categorizing a data set and applying a function to each group,
whether an aggregation or transformation.
• After loading, merging, and preparing a data set, a familiar
task is to compute group statistics or possibly pivot tables for
reporting or visualization purposes.
• pandas provides a flexible and high-performance groupby
facility, enabling you to slice and dice, and summarize data
sets in a natural way
Learning Objectives
• Split a pandas object into pieces using one or more keys (in
the form of functions, arrays, or DataFrame column names)
• Computing group summary statistics, like count, mean, or
standard deviation, or a user-defined function
• Apply a varying set of functions to each column of a
DataFrame
• Apply within-group transformations or other manipulations,
like normalization, linear regression, rank, or subset selection
•
• Compute pivot tables and cross-tabulations •
• Perform quantile analysis and other data-derived group
analyses
GroupBy Mechanics
Pandas DataFrame
It is a 2-dimensional labeled data structure with columns of potentially different
types. It is generally the most commonly used pandas object.
print (df)
The primary pandas data structure.
To start, let’s say that you have the following data about Cars, and that you want to
capture that data in Python using Pandas DataFrame:
Brand Price
Honda Civic 22000
Toyota Corolla 25000
Ford Focus 27000
Audi A4 35000
This is how the Python code would look like for our
example:
import pandas as pd
print (df)
• You may have noticed that each row is represented by a number
(also known as the index) starting from 0. Alternatively, you may
assign another value/name to represent each row.
• import pandas as pd
• cars = {'Brand': ['Honda Civic','Toyota Corolla','Ford Focus','Audi
A4'], 'Price': [22000,25000,27000,35000] }
• print (df)
Method 2: importing values from an Excel file to
create Pandas DataFrame
• import pandas as pd
• data = pd.read_excel(r'Path where the Excel file is stored\File
name.xlsx')
• #for an earlier version of Excel use 'xls'
• df = pd.DataFrame(data, columns = ['First Column
Name','Second Column Name',...])
• print (df)
• import pandas as pd
• cars = pd.read_excel(r'C:\Users\Kavi Guru\Desktop\CARS.xls')
• df = pd.DataFrame(cars, columns = ['Brand', 'Price'])
• print (df)
Creating Pandas DataFrame from lists of
lists.
• Import pandas library
• import pandas as pd
•
• # initialize list of lists
• data = [['tom', 10], ['nick', 15], ['juli', 14]]
•
• # Create the pandas DataFrame
• df = pd.DataFrame(data, columns = ['Name', 'Age'])
•
• # print dataframe.
• df
Creating DataFrame from dict of
narray/lists
• To create DataFrame from dict of narray/list, all the narray must be of same length.
If index is passed then the length index should be equal to the length of arrays. If no
index is passed, then by default, index will be range(n) where n is the array length.
• # Python code demonstrate creating
• # DataFrame from dict narray / lists
• # By default addresses.
•
• import pandas as pd
•
• # intialise data of lists.
• data = {'Name':['Tom', 'nick', 'krish', 'jack'], 'Age':[20, 21, 19, 18]}
•
• # Create DataFrame
• df = pd.DataFrame(data)
•
• # Print the output.
• df
Another example to create pandas DataFrame from lists of dictionaries with both
row index as well as column index.
• Python code demonstrate to create a
• # Pandas DataFrame with lists of
• # dictionaries as well as
• # row and column indexes.
•
• import pandas as pd
•
• # Intitialise lists data.
• data = [{'a': 1, 'b': 2}, {'a': 5, 'b': 10, 'c': 20}]
•
• # With two column indices, values same
• # as dictionary keys
• df1 = pd.DataFrame(data, index =['first', 'second'], columns =['a', 'b'])
•
• # With two column indices with
• # one index with other name
• df2 = pd.DataFrame(data, index =['first', 'second'], columns =['a', 'b1'])
•
• # print for first data frame
• print (df1, "\n")
•
• # Print for second DataFrame.
Creating DataFrame using zip() function.
• Two lists can be merged by using list(zip()) function. Now, create the pandas DataFrame
calling pd.DataFrame() function.
• import pandas as pd
• # List1
• Name = ['tom', 'krish', 'nick', 'juli']
• # List2
• Age = [25, 30, 26, 22]
• # get the list of tuples from two lists.
• # and merge them by using zip().
• list_of_tuples = list(zip(Name, Age))
• # Assign data to tuples.
• list_of_tuples
• # Converting lists of tuples into
• # pandas Dataframe.
• df = pd.DataFrame(list_of_tuples, columns = ['Name', 'Age'])
• # Print data.
• df
Get the maximum value from the DataFrame
• max1 = df['Price'].max()
1. Group the unique values from the Team column
2. Now there’s a bucket for each group
3. Toss the other data into the buckets
4. Apply a function on the weight column of each bucket.
Grouping data with one key:.
• In order to group data with one key, we pass only one key as an argument in groupby function
• import pandas as pd
•
• # Define a dictionary containing employee data
• data1 = {'Name':['Jai', 'Anuj', 'Jai', 'Princi',
• 'Gaurav', 'Anuj', 'Princi', 'Abhi'],
• 'Age':[27, 24, 22, 32,
• 33, 36, 27, 32],
• 'Address':['Nagpur', 'Kanpur', 'Allahabad', 'Kannuaj',
• 'Jaunpur', 'Kanpur', 'Allahabad', 'Aligarh'],
• 'Qualification':['Msc', 'MA', 'MCA', 'Phd',
• 'B.Tech', 'B.com', 'Msc', 'MA']}
•
•
• # Convert the dictionary into DataFrame
• df = pd.DataFrame(data1)
• print(df)
• df.groupby('Name')
• print(df.groupby('Name').groups)
Grouping data with multiple keys :
• In order to group data with multiple keys, we pass multiple keys
in groupby function
• df.groupby(['Name', 'Qualification'])
• print(df.groupby(['Name','Qualification']).groups)
• Group keys are sorted by default using the groupby operation. User can
pass sort=False for potential speedups.
• grp = df.groupby('Name')
• grp.get_group('Jai')
Aggregation
• Aggregation is a process in which we compute a summary
statistic about each group.
• Aggregated function returns a single aggregated value for each
group.
• After splitting a data into groups using groupby function,
several aggregation operations can be performed on the
grouped data.
Code #1: Using aggregation via the aggregate method
• # importing pandas module
• import pandas as pd
• # importing numpy as np
• import numpy as np
• print(df)
• Now we perform aggregation using aggregate method
• # performing aggregation using
• # aggregate method
• grp1 = df.groupby('Name')
• grp1.aggregate(np.sum)
Now we perform aggregation on agroup
containing multiple keys
• # performing aggregation on
• # group containing multiple
• # keys
• grp1 = df.groupby(['Name', 'Qualification'])
• grp1.aggregate(np.sum)
Now we apply a multiple functions by
passing a list of functions.
• # applying a function by passing
• # a list of functions
• grp = df.groupby('Name')
• # importing numpy as np
• import numpy as np
• print(df)
Now we perform some group-specific computations
and return a like-indexed