0% found this document useful (0 votes)

27 views34 pages

Data Aggregation and Group Operations

Uploaded by

Kavitha

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

27 views34 pages

Data Aggregation and Group Operations

Uploaded by

Kavitha

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

You are on page 1/ 34

Data Aggregation

and Group
Operations
Data Aggregation and Group By
• Categorizing a data set and applying a function to each group,
whether an aggregation or transformation.
• After loading, merging, and preparing a data set, a familiar
task is to compute group statistics or possibly pivot tables for
reporting or visualization purposes.
• pandas provides a flexible and high-performance groupby
facility, enabling you to slice and dice, and summarize data
sets in a natural way
Learning Objectives
• Split a pandas object into pieces using one or more keys (in
the form of functions, arrays, or DataFrame column names)
• Computing group summary statistics, like count, mean, or
standard deviation, or a user-defined function
• Apply a varying set of functions to each column of a
DataFrame
• Apply within-group transformations or other manipulations,
like normalization, linear regression, rank, or subset selection
•
• Compute pivot tables and cross-tabulations •
• Perform quantile analysis and other data-derived group
analyses
GroupBy Mechanics
Pandas DataFrame
It is a 2-dimensional labeled data structure with columns of potentially different
types. It is generally the most commonly used pandas object.

We will see two different methods to create Pandas DataFrame:

• By typing the values in Python itself to create the DataFrame
• By importing the values from a file (such as an Excel file), and then creating the
DataFrame in Python based on the values imported

Method 1: typing values in Python to create Pandas DataFrame

Syntax:
import pandas as pd

data = {'First Column Name': ['First value', 'Second value',...],

'Second Column Name': ['First value', 'Second value',...], .... }

df = pd.DataFrame (data, columns = ['First Column Name','Second Column Name',...])

print (df)
The primary pandas data structure.

data : numpy ndarray (structured or homogeneous),

dict, or DataFrame
Dict can contain Series, arrays, constants, or list-like
objects
Changed in version 0.23.0: If data is a dict, argument
order is maintained for Python 3.6 and later.
index : Index or array-like
Index to use for resulting frame. Will default to
RangeIndex if no indexing information part of input
Parameters: data and no index provided
columns : Index or array-like
Column labels to use for resulting frame. Will default
to RangeIndex (0, 1, 2, …, n) if no column labels are
provided
dtype : dtype, default None
Data type to force. Only a single dtype is allowed. If
None, infer
copy : boolean, default False
Copy data from inputs. Only affects DataFrame / 2d
ndarray input
Now let’s see how to apply the above template using a simple
example.

To start, let’s say that you have the following data about Cars, and that you want to
capture that data in Python using Pandas DataFrame:

Brand Price
Honda Civic 22000
Toyota Corolla 25000
Ford Focus 27000
Audi A4 35000
This is how the Python code would look like for our
example:
import pandas as pd

cars = {'Brand': ['Honda Civic','Toyota Corolla','Ford

Focus','Audi A4'],
'Price': [22000,25000,27000,35000] }

df = pd.DataFrame(cars, columns = ['Brand', 'Price'])

print (df)
• You may have noticed that each row is represented by a number
(also known as the index) starting from 0. Alternatively, you may
assign another value/name to represent each row.

• For example, in the code below,

the index=[‘Car_1′,’Car_2′,’Car_3′,’Car_4’] was added:

• import pandas as pd
• cars = {'Brand': ['Honda Civic','Toyota Corolla','Ford Focus','Audi
A4'], 'Price': [22000,25000,27000,35000] }

• df = pd.DataFrame(cars, columns = ['Brand','Price'],

index=['Car_1','Car_2','Car_3','Car_4'])

• print (df)
Method 2: importing values from an Excel file to
create Pandas DataFrame
• import pandas as pd
• data = pd.read_excel(r'Path where the Excel file is stored\File
name.xlsx')
• #for an earlier version of Excel use 'xls'
• df = pd.DataFrame(data, columns = ['First Column
Name','Second Column Name',...])
• print (df)
• import pandas as pd
• cars = pd.read_excel(r'C:\Users\Kavi Guru\Desktop\CARS.xls')
• df = pd.DataFrame(cars, columns = ['Brand', 'Price'])
• print (df)
Creating Pandas DataFrame from lists of
lists.
• Import pandas library
• import pandas as pd
•
• # initialize list of lists
• data = [['tom', 10], ['nick', 15], ['juli', 14]]
•
• # Create the pandas DataFrame
• df = pd.DataFrame(data, columns = ['Name', 'Age'])
•
• # print dataframe.
• df
Creating DataFrame from dict of
narray/lists
• To create DataFrame from dict of narray/list, all the narray must be of same length.
If index is passed then the length index should be equal to the length of arrays. If no
index is passed, then by default, index will be range(n) where n is the array length.
• # Python code demonstrate creating
• # DataFrame from dict narray / lists
• # By default addresses.
•
• import pandas as pd
•
• # intialise data of lists.
• data = {'Name':['Tom', 'nick', 'krish', 'jack'], 'Age':[20, 21, 19, 18]}
•
• # Create DataFrame
• df = pd.DataFrame(data)
•
• # Print the output.
• df
Another example to create pandas DataFrame from lists of dictionaries with both
row index as well as column index.
• Python code demonstrate to create a
• # Pandas DataFrame with lists of
• # dictionaries as well as
• # row and column indexes.
•
• import pandas as pd
•
• # Intitialise lists data.
• data = [{'a': 1, 'b': 2}, {'a': 5, 'b': 10, 'c': 20}]
•
• # With two column indices, values same
• # as dictionary keys
• df1 = pd.DataFrame(data, index =['first', 'second'], columns =['a', 'b'])
•
• # With two column indices with
• # one index with other name
• df2 = pd.DataFrame(data, index =['first', 'second'], columns =['a', 'b1'])
•
• # print for first data frame
• print (df1, "\n")
•
• # Print for second DataFrame.
Creating DataFrame using zip() function.
• Two lists can be merged by using list(zip()) function. Now, create the pandas DataFrame
calling pd.DataFrame() function.

• import pandas as pd
• # List1
• Name = ['tom', 'krish', 'nick', 'juli']
• # List2
• Age = [25, 30, 26, 22]
• # get the list of tuples from two lists.
• # and merge them by using zip().
• list_of_tuples = list(zip(Name, Age))
• # Assign data to tuples.
• list_of_tuples
• # Converting lists of tuples into
• # pandas Dataframe.
• df = pd.DataFrame(list_of_tuples, columns = ['Name', 'Age'])
• # Print data.
• df
Get the maximum value from the DataFrame

• max1 = df['Price'].max()
1. Group the unique values from the Team column
2. Now there’s a bucket for each group
3. Toss the other data into the buckets
4. Apply a function on the weight column of each bucket.
Grouping data with one key:.
• In order to group data with one key, we pass only one key as an argument in groupby function
• import pandas as pd
•
• # Define a dictionary containing employee data
• data1 = {'Name':['Jai', 'Anuj', 'Jai', 'Princi',
• 'Gaurav', 'Anuj', 'Princi', 'Abhi'],
• 'Age':[27, 24, 22, 32,
• 33, 36, 27, 32],
• 'Address':['Nagpur', 'Kanpur', 'Allahabad', 'Kannuaj',
• 'Jaunpur', 'Kanpur', 'Allahabad', 'Aligarh'],
• 'Qualification':['Msc', 'MA', 'MCA', 'Phd',
• 'B.Tech', 'B.com', 'Msc', 'MA']}
•
•
• # Convert the dictionary into DataFrame
• df = pd.DataFrame(data1)
• print(df)

• df.groupby('Name')
• print(df.groupby('Name').groups)
Grouping data with multiple keys :
• In order to group data with multiple keys, we pass multiple keys
in groupby function

• df.groupby(['Name', 'Qualification'])

• print(df.groupby(['Name','Qualification']).groups)

• Group keys are sorted by default using the groupby operation. User can
pass sort=False for potential speedups.

• using groupby function without using sort

• df.groupby(['Name']).sum()

• # using groupby function with sort

• df.groupby(['Name'], sort = False).sum()
•
Iteration in groups.
• Now we iterate an element of group in a similar way we do in
itertools.obj
• grp = df.groupby('Name')
• for name, group in grp:
• print(name)
• print(group)
• print()
Now we iterate an element of group
containing multiple keys
• # iterating an element
• # of group containing
• # multiple keys

• grp = df.groupby(['Name', 'Qualification'])

• for name, group in grp:
• print(name)
• print(group)
• print()
In order to select a group, we can select group
using GroupBy.get_group().

• We can select a group by applying a

function GroupBy.get_group this function select a single group.

# selecting a single group

• grp = df.groupby('Name')
• grp.get_group('Jai')
Aggregation
• Aggregation is a process in which we compute a summary
statistic about each group.
• Aggregated function returns a single aggregated value for each
group.
• After splitting a data into groups using groupby function,
several aggregation operations can be performed on the
grouped data.
Code #1: Using aggregation via the aggregate method
• # importing pandas module
• import pandas as pd

• # importing numpy as np
• import numpy as np

• # Define a dictionary containing employee data

• data1 = {'Name':['Jai', 'Anuj', 'Jai', 'Princi',
• 'Gaurav', 'Anuj', 'Princi', 'Abhi'],
• 'Age':[27, 24, 22, 32,
• 33, 36, 27, 32],
• 'Address':['Nagpur', 'Kanpur', 'Allahabad', 'Kannuaj',
• 'Jaunpur', 'Kanpur', 'Allahabad', 'Aligarh'],
• 'Qualification':['Msc', 'MA', 'MCA', 'Phd',
• 'B.Tech', 'B.com', 'Msc', 'MA']}
•

• # Convert the dictionary into DataFrame

• df = pd.DataFrame(data1)

• print(df)
• Now we perform aggregation using aggregate method
• # performing aggregation using
• # aggregate method

• grp1 = df.groupby('Name')

• grp1.aggregate(np.sum)
Now we perform aggregation on agroup
containing multiple keys
• # performing aggregation on
• # group containing multiple
• # keys
• grp1 = df.groupby(['Name', 'Qualification'])

• grp1.aggregate(np.sum)
Now we apply a multiple functions by
passing a list of functions.
• # applying a function by passing
• # a list of functions

• grp = df.groupby('Name')

• grp['Age'].agg([np.sum, np.mean, np.std])

Now we apply a different aggregation to
the columns of a dataframe.
• # using different aggregation
• # function by passing dictionary
• # to aggregate
• grp = df.groupby('Name')

• grp.agg({'Age' : 'sum', 'Score' : 'std'})

Transformation :

• Transformation is a process in which we perform some group-specific computations and

return a like-indexed.
• # importing pandas module
• import pandas as pd

• # importing numpy as np
• import numpy as np

• # Define a dictionary containing employee data

• data1 = {'Name':['Jai', 'Anuj', 'Jai', 'Princi', 'Gaurav', 'Anuj', 'Princi', 'Abhi'],
• 'Age':[27, 24, 22, 32, 33, 36, 27, 32],
• 'Address':['Nagpur', 'Kanpur', 'Allahabad', 'Kannuaj', 'Jaunpur', 'Kanpur',
'Allahabad', 'Aligarh'],
• 'Qualification':['Msc', 'MA', 'MCA', 'Phd', 'B.Tech', 'B.com', 'Msc', 'MA'],
• 'Score': [23, 34, 35, 45, 47, 50, 52, 53]}
•
• # Convert the dictionary into DataFrame
• df = pd.DataFrame(data1)

• print(df)
Now we perform some group-specific computations
and return a like-indexed

# using transform function

grp = df.groupby('Name')
sc = lambda x: (x - x.mean()) / x.std()*10
grp.transform(sc)
Filtration
• Filtration is a process in which we discard some groups, according
to a group-wise computation that evaluates True or False. In order
to filter a group, we use filter method and apply some condition by
which we filter group.
• # filtering data using
• # filter data
• grp = df.groupby('Name')
• grp.filter(lambda x: len(x) >= 2)

BBFEST 2022 - Proposal - Sponsorhip
No ratings yet
BBFEST 2022 - Proposal - Sponsorhip
24 pages
Superiority of Christianity Over Other Religions On Earth by Pastor Paul Rika Ebook
100% (1)
Superiority of Christianity Over Other Religions On Earth by Pastor Paul Rika Ebook
76 pages
10 - Chapter 3 - History of Indian Cinema
No ratings yet
10 - Chapter 3 - History of Indian Cinema
42 pages
Chapter 2 Data Handling Using Pandas - I (DATA FRAME)
No ratings yet
Chapter 2 Data Handling Using Pandas - I (DATA FRAME)
15 pages
Exp1 - Manipulating Datasets Using Pandas
No ratings yet
Exp1 - Manipulating Datasets Using Pandas
15 pages
Cheat Sheet: The Pandas Dataframe Object: Preliminaries Get Your Data Into A Dataframe
100% (1)
Cheat Sheet: The Pandas Dataframe Object: Preliminaries Get Your Data Into A Dataframe
12 pages
Cheat Sheet
No ratings yet
Cheat Sheet
10 pages
Cheat Sheet: The Pandas Dataframe Object: Preliminaries Get Your Data Into A Dataframe
100% (1)
Cheat Sheet: The Pandas Dataframe Object: Preliminaries Get Your Data Into A Dataframe
10 pages
Persuasive
No ratings yet
Persuasive
4 pages
Python Notes by Prof T
No ratings yet
Python Notes by Prof T
10 pages
Pandas
No ratings yet
Pandas
41 pages
05 Pandas Data Frames
No ratings yet
05 Pandas Data Frames
33 pages
Huawei ICT Competition 2023-2024 Exam Outline - Cloud Track
0% (1)
Huawei ICT Competition 2023-2024 Exam Outline - Cloud Track
1 page
Pandas, Numpy, Matplotlib
No ratings yet
Pandas, Numpy, Matplotlib
11 pages
Im01b08k02 02en
No ratings yet
Im01b08k02 02en
478 pages
Pandas Cheat Sheet
100% (2)
Pandas Cheat Sheet
6 pages
Pandas DataFrame1
No ratings yet
Pandas DataFrame1
22 pages
Pandas DataFrame Notes
100% (1)
Pandas DataFrame Notes
10 pages
Pandas DataFrame Notes
67% (3)
Pandas DataFrame Notes
13 pages
AIO2024 LLamaIndex
No ratings yet
AIO2024 LLamaIndex
65 pages
Data Handling Using Pandas-1
No ratings yet
Data Handling Using Pandas-1
60 pages
Chapter 1 Python Pandas - I
No ratings yet
Chapter 1 Python Pandas - I
35 pages
Pandas DataFrame Notes
100% (1)
Pandas DataFrame Notes
6 pages
Python Cheat Sheet For Excel Users
100% (2)
Python Cheat Sheet For Excel Users
5 pages
Working With Panda
No ratings yet
Working With Panda
13 pages
Pandas
No ratings yet
Pandas
13 pages
Data Frame
No ratings yet
Data Frame
95 pages
Arthur Conan Doyle
No ratings yet
Arthur Conan Doyle
15 pages
DAP 3 Module
No ratings yet
DAP 3 Module
62 pages
LESSON 7 - Prepositions, Conjunctions, & Interjections
No ratings yet
LESSON 7 - Prepositions, Conjunctions, & Interjections
21 pages
Directions A2 SS
No ratings yet
Directions A2 SS
2 pages
Data Wrangling With Python and Pandas
No ratings yet
Data Wrangling With Python and Pandas
7 pages
04-Data Manipulation With Pandas
No ratings yet
04-Data Manipulation With Pandas
28 pages
Java Programming Unit-1 Mega Notes
No ratings yet
Java Programming Unit-1 Mega Notes
42 pages
Tata Sky Packages
No ratings yet
Tata Sky Packages
10 pages
Python Pandas ch-2
No ratings yet
Python Pandas ch-2
56 pages
DataFrame Notes1
No ratings yet
DataFrame Notes1
32 pages
Usage of NumPy For Numerical Data in Detail
No ratings yet
Usage of NumPy For Numerical Data in Detail
52 pages
Adjective Clause Rayos Jasmin
No ratings yet
Adjective Clause Rayos Jasmin
22 pages
Block 1-Data Handling Using Pandas DataFrame
No ratings yet
Block 1-Data Handling Using Pandas DataFrame
17 pages
Algebra 1
No ratings yet
Algebra 1
69 pages
Pandas - Digitalocean
No ratings yet
Pandas - Digitalocean
15 pages
IP Practical File - Reference
No ratings yet
IP Practical File - Reference
98 pages
Chapter Notes - Data Handling Using Pandas DataFrame
No ratings yet
Chapter Notes - Data Handling Using Pandas DataFrame
16 pages
NumPy and Pandas
No ratings yet
NumPy and Pandas
12 pages
Ip Study
No ratings yet
Ip Study
18 pages
Pandas
No ratings yet
Pandas
27 pages
Unit 3 (FODS)
No ratings yet
Unit 3 (FODS)
34 pages
The Pandas Library
No ratings yet
The Pandas Library
39 pages
Pandas Cheat Sheet........
No ratings yet
Pandas Cheat Sheet........
11 pages
Pandas Handbook
No ratings yet
Pandas Handbook
33 pages
Pandas DataFrame Notes
No ratings yet
Pandas DataFrame Notes
10 pages
Eti 22618 Ut1 Question Bank 290120
No ratings yet
Eti 22618 Ut1 Question Bank 290120
39 pages
Reading and Writing (Science Fiction
No ratings yet
Reading and Writing (Science Fiction
3 pages
DevOps Session 3 Pandas
No ratings yet
DevOps Session 3 Pandas
33 pages
CH 1 To CH 3 10th Class
No ratings yet
CH 1 To CH 3 10th Class
9 pages
NumPy and Pandas Tutorial
No ratings yet
NumPy and Pandas Tutorial
8 pages
Cheat Sheet: The Pandas Dataframe Object: Column Index (DF - Columns)
No ratings yet
Cheat Sheet: The Pandas Dataframe Object: Column Index (DF - Columns)
6 pages
FDS Module 2 Notes
No ratings yet
FDS Module 2 Notes
24 pages
Pandas DataFrame Notes
No ratings yet
Pandas DataFrame Notes
13 pages
A Review On Automatic Speech Recognition Architect
No ratings yet
A Review On Automatic Speech Recognition Architect
13 pages
Day08-Pandas-Tutorial: Pandas - by Punith V T
No ratings yet
Day08-Pandas-Tutorial: Pandas - by Punith V T
8 pages
Cheat Sheet - Pandas
No ratings yet
Cheat Sheet - Pandas
12 pages
12 Pandas
No ratings yet
12 Pandas
9 pages
All Document Reader 1715619870900
No ratings yet
All Document Reader 1715619870900
6 pages
Pandas
No ratings yet
Pandas
16 pages
Pandas Cheat Sheet
100% (1)
Pandas Cheat Sheet
2 pages
Python Pandas New Sylabus
No ratings yet
Python Pandas New Sylabus
53 pages
Research Document Group 2
No ratings yet
Research Document Group 2
20 pages
Pandas PDF
No ratings yet
Pandas PDF
25 pages
Class 12 Panda Project
No ratings yet
Class 12 Panda Project
13 pages
Unit-8 StructuresandUnions
No ratings yet
Unit-8 StructuresandUnions
9 pages
Epals School Mail 101 Briefly
No ratings yet
Epals School Mail 101 Briefly
48 pages
Installing and Registering FSUIPC7
No ratings yet
Installing and Registering FSUIPC7
6 pages
Creative Question Starts
No ratings yet
Creative Question Starts
1 page
Pandas DataFrameObject
No ratings yet
Pandas DataFrameObject
4 pages
Pandas
No ratings yet
Pandas
9 pages
Commands SQL, Python (BASICS)
No ratings yet
Commands SQL, Python (BASICS)
7 pages
Pandas
No ratings yet
Pandas
5 pages
Cheat Sheet: The Pandas Dataframe Object I: Preliminaries Get Your Data Into A Dataframe
No ratings yet
Cheat Sheet: The Pandas Dataframe Object I: Preliminaries Get Your Data Into A Dataframe
12 pages
Procedure Text - Strawberryjuice - XIMIA5
No ratings yet
Procedure Text - Strawberryjuice - XIMIA5
10 pages
LMO 2021 Grade 2
No ratings yet
LMO 2021 Grade 2
2 pages
Typeof Examination Academic Preparationand Performanceof BSN Three Studentsof Davao Doctors College
No ratings yet
Typeof Examination Academic Preparationand Performanceof BSN Three Studentsof Davao Doctors College
5 pages
Soal PAS 12 2023
No ratings yet
Soal PAS 12 2023
7 pages
Gutierrez Gaby Tte 540 Unit Plan
No ratings yet
Gutierrez Gaby Tte 540 Unit Plan
7 pages
ELLLO A2 05C Going To
No ratings yet
ELLLO A2 05C Going To
2 pages
16 Tenses Dalam Bahasa Inggris
No ratings yet
16 Tenses Dalam Bahasa Inggris
3 pages
Data Science Programming In Python
From Everand
Data Science Programming In Python
Anita Raichand
No ratings yet
Advanced C Concepts and Programming: First Edition
From Everand
Advanced C Concepts and Programming: First Edition
Gayatri
3/5 (1)
The Essential R Reference
From Everand
The Essential R Reference
Mark Gardener
No ratings yet

Data Aggregation and Group Operations

Uploaded by

Data Aggregation and Group Operations

Uploaded by

Data Aggregation

We will see two different methods to create Pandas DataFrame:

Method 1: typing values in Python to create Pandas DataFrame

data = {'First Column Name': ['First value', 'Second value',...],

df = pd.DataFrame (data, columns = ['First Column Name','Second Column Name',...])

data : numpy ndarray (structured or homogeneous),

cars = {'Brand': ['Honda Civic','Toyota Corolla','Ford

df = pd.DataFrame(cars, columns = ['Brand', 'Price'])

• For example, in the code below,

• df = pd.DataFrame(cars, columns = ['Brand','Price'],

• using groupby function without using sort

• # using groupby function with sort

• grp = df.groupby(['Name', 'Qualification'])

• We can select a group by applying a

# selecting a single group

• # Define a dictionary containing employee data

• # Convert the dictionary into DataFrame

• grp['Age'].agg([np.sum, np.mean, np.std])

• grp.agg({'Age' : 'sum', 'Score' : 'std'})

• Transformation is a process in which we perform some group-specific computations and

• # Define a dictionary containing employee data

# using transform function

You might also like