0% found this document useful (0 votes)
26 views34 pages

Data Aggregation and Group Operations

Uploaded by

Kavitha
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
26 views34 pages

Data Aggregation and Group Operations

Uploaded by

Kavitha
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 34

Data Aggregation

and Group
Operations
Data Aggregation and Group By
• Categorizing a data set and applying a function to each group,
whether an aggregation or transformation.
• After loading, merging, and preparing a data set, a familiar
task is to compute group statistics or possibly pivot tables for
reporting or visualization purposes.
• pandas provides a flexible and high-performance groupby
facility, enabling you to slice and dice, and summarize data
sets in a natural way
Learning Objectives
• Split a pandas object into pieces using one or more keys (in
the form of functions, arrays, or DataFrame column names)
• Computing group summary statistics, like count, mean, or
standard deviation, or a user-defined function
• Apply a varying set of functions to each column of a
DataFrame
• Apply within-group transformations or other manipulations,
like normalization, linear regression, rank, or subset selection

• Compute pivot tables and cross-tabulations •
• Perform quantile analysis and other data-derived group
analyses
GroupBy Mechanics
Pandas DataFrame
It is a 2-dimensional labeled data structure with columns of potentially different
types. It is generally the most commonly used pandas object.

We will see two different methods to create Pandas DataFrame:


• By typing the values in Python itself to create the DataFrame
• By importing the values from a file (such as an Excel file), and then creating the
DataFrame in Python based on the values imported

Method 1: typing values in Python to create Pandas DataFrame


Syntax:
import pandas as pd

data = {'First Column Name': ['First value', 'Second value',...],


'Second Column Name': ['First value', 'Second value',...], .... }

df = pd.DataFrame (data, columns = ['First Column Name','Second Column Name',...])

print (df)
The primary pandas data structure.

data : numpy ndarray (structured or homogeneous),


dict, or DataFrame
Dict can contain Series, arrays, constants, or list-like
objects
Changed in version 0.23.0: If data is a dict, argument
order is maintained for Python 3.6 and later.
index : Index or array-like
Index to use for resulting frame. Will default to
RangeIndex if no indexing information part of input
Parameters: data and no index provided
columns : Index or array-like
Column labels to use for resulting frame. Will default
to RangeIndex (0, 1, 2, …, n) if no column labels are
provided
dtype : dtype, default None
Data type to force. Only a single dtype is allowed. If
None, infer
copy : boolean, default False
Copy data from inputs. Only affects DataFrame / 2d
ndarray input
Now let’s see how to apply the above template using a simple
example.

To start, let’s say that you have the following data about Cars, and that you want to
capture that data in Python using Pandas DataFrame:

Brand Price
Honda Civic 22000
Toyota Corolla 25000
Ford Focus 27000
Audi A4 35000
This is how the Python code would look like for our
example:
import pandas as pd

cars = {'Brand': ['Honda Civic','Toyota Corolla','Ford


Focus','Audi A4'],
'Price': [22000,25000,27000,35000] }

df = pd.DataFrame(cars, columns = ['Brand', 'Price'])

print (df)
• You may have noticed that each row is represented by a number
(also known as the index) starting from 0. Alternatively, you may
assign another value/name to represent each row.

• For example, in the code below,


the index=[‘Car_1′,’Car_2′,’Car_3′,’Car_4’] was added:

• import pandas as pd
• cars = {'Brand': ['Honda Civic','Toyota Corolla','Ford Focus','Audi
A4'], 'Price': [22000,25000,27000,35000] }

• df = pd.DataFrame(cars, columns = ['Brand','Price'],


index=['Car_1','Car_2','Car_3','Car_4'])

• print (df)
Method 2: importing values from an Excel file to
create Pandas DataFrame
• import pandas as pd
• data = pd.read_excel(r'Path where the Excel file is stored\File
name.xlsx')
• #for an earlier version of Excel use 'xls'
• df = pd.DataFrame(data, columns = ['First Column
Name','Second Column Name',...])
• print (df)
• import pandas as pd
• cars = pd.read_excel(r'C:\Users\Kavi Guru\Desktop\CARS.xls')
• df = pd.DataFrame(cars, columns = ['Brand', 'Price'])
• print (df)
Creating Pandas DataFrame from lists of
lists.
• Import pandas library
• import pandas as pd

• # initialize list of lists
• data = [['tom', 10], ['nick', 15], ['juli', 14]]

• # Create the pandas DataFrame
• df = pd.DataFrame(data, columns = ['Name', 'Age'])

• # print dataframe.
• df
Creating DataFrame from dict of
narray/lists
• To create DataFrame from dict of narray/list, all the narray must be of same length.
If index is passed then the length index should be equal to the length of arrays. If no
index is passed, then by default, index will be range(n) where n is the array length.
• # Python code demonstrate creating
• # DataFrame from dict narray / lists
• # By default addresses.

• import pandas as pd

• # intialise data of lists.
• data = {'Name':['Tom', 'nick', 'krish', 'jack'], 'Age':[20, 21, 19, 18]}

• # Create DataFrame
• df = pd.DataFrame(data)

• # Print the output.
• df
Another example to create pandas DataFrame from lists of dictionaries with both
row index as well as column index.
• Python code demonstrate to create a
• # Pandas DataFrame with lists of
• # dictionaries as well as
• # row and column indexes.

• import pandas as pd

• # Intitialise lists data.
• data = [{'a': 1, 'b': 2}, {'a': 5, 'b': 10, 'c': 20}]

• # With two column indices, values same
• # as dictionary keys
• df1 = pd.DataFrame(data, index =['first', 'second'], columns =['a', 'b'])

• # With two column indices with
• # one index with other name
• df2 = pd.DataFrame(data, index =['first', 'second'], columns =['a', 'b1'])

• # print for first data frame
• print (df1, "\n")

• # Print for second DataFrame.
Creating DataFrame using zip() function.
• Two lists can be merged by using list(zip()) function. Now, create the pandas DataFrame
calling pd.DataFrame() function.

• import pandas as pd
• # List1
• Name = ['tom', 'krish', 'nick', 'juli']
• # List2
• Age = [25, 30, 26, 22]
• # get the list of tuples from two lists.
• # and merge them by using zip().
• list_of_tuples = list(zip(Name, Age))
• # Assign data to tuples.
• list_of_tuples
• # Converting lists of tuples into
• # pandas Dataframe.
• df = pd.DataFrame(list_of_tuples, columns = ['Name', 'Age'])
• # Print data.
• df
Get the maximum value from the DataFrame

• max1 = df['Price'].max()
1. Group the unique values from the Team column
2. Now there’s a bucket for each group
3. Toss the other data into the buckets
4. Apply a function on the weight column of each bucket.
Grouping data with one key:.
• In order to group data with one key, we pass only one key as an argument in groupby function
• import pandas as pd

• # Define a dictionary containing employee data
• data1 = {'Name':['Jai', 'Anuj', 'Jai', 'Princi',
• 'Gaurav', 'Anuj', 'Princi', 'Abhi'],
• 'Age':[27, 24, 22, 32,
• 33, 36, 27, 32],
• 'Address':['Nagpur', 'Kanpur', 'Allahabad', 'Kannuaj',
• 'Jaunpur', 'Kanpur', 'Allahabad', 'Aligarh'],
• 'Qualification':['Msc', 'MA', 'MCA', 'Phd',
• 'B.Tech', 'B.com', 'Msc', 'MA']}


• # Convert the dictionary into DataFrame
• df = pd.DataFrame(data1)
• print(df)

• df.groupby('Name')
• print(df.groupby('Name').groups)
Grouping data with multiple keys :
• In order to group data with multiple keys, we pass multiple keys
in groupby function

• df.groupby(['Name', 'Qualification'])

• print(df.groupby(['Name','Qualification']).groups)

• Group keys are sorted by default using the groupby operation. User can
pass sort=False for potential speedups.

• using groupby function without using sort


• df.groupby(['Name']).sum()

• # using groupby function with sort


• df.groupby(['Name'], sort = False).sum()

Iteration in groups.
• Now we iterate an element of group in a similar way we do in
itertools.obj
• grp = df.groupby('Name')
• for name, group in grp:
• print(name)
• print(group)
• print()
Now we iterate an element of group
containing multiple keys
• # iterating an element
• # of group containing
• # multiple keys

• grp = df.groupby(['Name', 'Qualification'])


• for name, group in grp:
• print(name)
• print(group)
• print()
In order to select a group, we can select group
using GroupBy.get_group().

• We can select a group by applying a


function GroupBy.get_group this function select a single group.

# selecting a single group

• grp = df.groupby('Name')
• grp.get_group('Jai')
Aggregation
• Aggregation is a process in which we compute a summary
statistic about each group.
• Aggregated function returns a single aggregated value for each
group.
• After splitting a data into groups using groupby function,
several aggregation operations can be performed on the
grouped data.
Code #1: Using aggregation via the aggregate method
• # importing pandas module
• import pandas as pd

• # importing numpy as np
• import numpy as np

• # Define a dictionary containing employee data


• data1 = {'Name':['Jai', 'Anuj', 'Jai', 'Princi',
• 'Gaurav', 'Anuj', 'Princi', 'Abhi'],
• 'Age':[27, 24, 22, 32,
• 33, 36, 27, 32],
• 'Address':['Nagpur', 'Kanpur', 'Allahabad', 'Kannuaj',
• 'Jaunpur', 'Kanpur', 'Allahabad', 'Aligarh'],
• 'Qualification':['Msc', 'MA', 'MCA', 'Phd',
• 'B.Tech', 'B.com', 'Msc', 'MA']}

• # Convert the dictionary into DataFrame


• df = pd.DataFrame(data1)

• print(df)
• Now we perform aggregation using aggregate method
• # performing aggregation using
• # aggregate method

• grp1 = df.groupby('Name')

• grp1.aggregate(np.sum)
Now we perform aggregation on agroup
containing multiple keys
• # performing aggregation on
• # group containing multiple
• # keys
• grp1 = df.groupby(['Name', 'Qualification'])

• grp1.aggregate(np.sum)
Now we apply a multiple functions by
passing a list of functions.
• # applying a function by passing
• # a list of functions

• grp = df.groupby('Name')

• grp['Age'].agg([np.sum, np.mean, np.std])


Now we apply a different aggregation to
the columns of a dataframe.
• # using different aggregation
• # function by passing dictionary
• # to aggregate
• grp = df.groupby('Name')

• grp.agg({'Age' : 'sum', 'Score' : 'std'})


Transformation :

• Transformation is a process in which we perform some group-specific computations and


return a like-indexed.
• # importing pandas module
• import pandas as pd

• # importing numpy as np
• import numpy as np

• # Define a dictionary containing employee data


• data1 = {'Name':['Jai', 'Anuj', 'Jai', 'Princi', 'Gaurav', 'Anuj', 'Princi', 'Abhi'],
• 'Age':[27, 24, 22, 32, 33, 36, 27, 32],
• 'Address':['Nagpur', 'Kanpur', 'Allahabad', 'Kannuaj', 'Jaunpur', 'Kanpur',
'Allahabad', 'Aligarh'],
• 'Qualification':['Msc', 'MA', 'MCA', 'Phd', 'B.Tech', 'B.com', 'Msc', 'MA'],
• 'Score': [23, 34, 35, 45, 47, 50, 52, 53]}

• # Convert the dictionary into DataFrame
• df = pd.DataFrame(data1)

• print(df)
Now we perform some group-specific computations
and return a like-indexed

# using transform function


grp = df.groupby('Name')
sc = lambda x: (x - x.mean()) / x.std()*10
grp.transform(sc)
Filtration
• Filtration is a process in which we discard some groups, according
to a group-wise computation that evaluates True or False. In order
to filter a group, we use filter method and apply some condition by
which we filter group.
• # filtering data using
• # filter data
• grp = df.groupby('Name')
• grp.filter(lambda x: len(x) >= 2)

You might also like