0% found this document useful (0 votes)
3 views11 pages

Group - by Python Code

The document discusses the 'group by' operation in data analysis using pandas, highlighting its importance for conditional aggregation of datasets. It provides examples of grouping data by specific keys, performing aggregations, and analyzing datasets such as NYC flight data and phone call records. The document also includes code snippets demonstrating various group by operations and their results.

Uploaded by

paulrajarshi7
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
3 views11 pages

Group - by Python Code

The document discusses the 'group by' operation in data analysis using pandas, highlighting its importance for conditional aggregation of datasets. It provides examples of grouping data by specific keys, performing aggregations, and analyzing datasets such as NYC flight data and phone call records. The document also includes code snippets demonstrating various group by operations and their results.

Uploaded by

paulrajarshi7
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 11

2/24/2020 3.

3_group_by - Jupyter Notebook

Group By: Split, Apply, Combine

Simple aggregations can give you a flavor of your dataset, but often we would prefer to aggregate conditionally
on some label or index: this is implemented in the so-called groupby operation. The name "group by" comes
from a command in the SQL database language, but it is perhaps more illuminative to think of it in the terms
first coined by Hadley Wickham of Rstats fame: split, apply, combine.

In [1]:

import numpy as np
import pandas as pd

localhost:8888/notebooks/Machine Learning/Python/3.3_group_by.ipynb 1/11


2/24/2020 3.3_group_by - Jupyter Notebook

In [8]:

df = pd.DataFrame({'key': ['A', 'B', 'C', 'A', 'B', 'C','B'],


'data': range(7)}, columns=['key', 'data'])
df

Out[8]:

key data

0 A 0

1 B 1

2 C 2

3 A 3

4 B 4

5 C 5

6 B 6

In [4]:

df.groupby('key')

Out[4]:

<pandas.core.groupby.generic.DataFrameGroupBy object at 0x0000010CC3FF41C8>

Notice that what is returned is not a set of DataFrames, but a DataFrameGroupBy object. This object is where
the magic is: you can think of it as a special view of the DataFrame, which is poised to dig into the groups but
does no actual computation until the aggregation is applied.

In [9]:

df.groupby('key').sum()

Out[9]:

data

key

A 3

B 11

C 7

In [13]:

for (key, group) in df.groupby('key'):


print("{0:20s} shape={1}".format(key, group.shape))

A shape=(2, 2)
B shape=(3, 2)
C shape=(2, 2)

localhost:8888/notebooks/Machine Learning/Python/3.3_group_by.ipynb 2/11


2/24/2020 3.3_group_by - Jupyter Notebook

In [15]:

nyc_flight = pd.read_csv('data/nyc_flights_2013.csv')

In [27]:

nyc_flight.head(3)

Out[27]:

year month day dep_time dep_delay arr_time arr_delay carrier tailnum flight origin

0 2013 1 1 517.0 2.0 830.0 11.0 UA N14228 1545 EWR

1 2013 1 1 533.0 4.0 850.0 20.0 UA N24211 1714 LGA

2 2013 1 1 542.0 2.0 923.0 33.0 AA N619AA 1141 JFK

SELECT carrier, count(*)


FROM df
GROUP BY carrier

In [26]:

n_by_flight = nyc_flight.groupby("carrier")["carrier"].count()
n_by_flight

Out[26]:

carrier
9E 2982
AA 5094
AS 115
B6 8144
DL 7160
EV 8298
F9 107
FL 527
HA 50
MQ 4137
OO 1
UA 8917
US 3152
VX 716
WN 1915
YV 103
Name: carrier, dtype: int64

localhost:8888/notebooks/Machine Learning/Python/3.3_group_by.ipynb 3/11


2/24/2020 3.3_group_by - Jupyter Notebook

In [3]:

data = pd.read_csv('data/phone_data.csv')
data.head(3)

Out[3]:

index date duration item month network network_type

0 0 15/10/14 06:58 34.429 data 2014-11 data data

1 1 15/10/14 06:58 13.000 call 2014-11 Vodafone mobile

2 2 15/10/14 14:46 23.000 call 2014-11 Meteor mobile

In [30]:

data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 830 entries, 0 to 829
Data columns (total 7 columns):
index 830 non-null int64
date 830 non-null object
duration 830 non-null float64
item 830 non-null object
month 830 non-null object
network 830 non-null object
network_type 830 non-null object
dtypes: float64(1), int64(1), object(5)
memory usage: 45.5+ KB

In [4]:

import dateutil

# Convert date from string to date times


#data['date'] = data['date'].apply(dateutil.parser.parse, dayfirst=True)
data['date'] = pd.to_datetime(data['date'])

In [33]:

data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 830 entries, 0 to 829
Data columns (total 7 columns):
index 830 non-null int64
date 830 non-null datetime64[ns]
duration 830 non-null float64
item 830 non-null object
month 830 non-null object
network 830 non-null object
network_type 830 non-null object
dtypes: datetime64[ns](1), float64(1), int64(1), object(4)
memory usage: 45.5+ KB

1. How many rows the dataset has?

localhost:8888/notebooks/Machine Learning/Python/3.3_group_by.ipynb 4/11


2/24/2020 3.3_group_by - Jupyter Notebook

In [35]:

data['item'].count()

Out[35]:

830

2. What was the longest phone call / data entry?

In [36]:

data['duration'].max()

Out[36]:

10528.0

3. How many seconds of phone calls are recorded in total?

In [7]:

data[data['item']=='call']['duration'].sum()

Out[7]:

92321.0

4. How many entries are there for each month?

In [38]:

data['month'].value_counts()

Out[38]:

2014-11 230
2015-01 205
2014-12 157
2015-02 137
2015-03 101
Name: month, dtype: int64

5. Number of non-null unique network entries

localhost:8888/notebooks/Machine Learning/Python/3.3_group_by.ipynb 5/11


2/24/2020 3.3_group_by - Jupyter Notebook

In [39]:

data['network'].nunique()

Out[39]:

6. Get the first entry for each month

In [40]:

data.groupby('month').first()

Out[40]:

index date duration item network network_type

month

2014-11 0 2014-10-15 06:58:00 34.429 data data data

2014-12 228 2014-11-13 06:58:00 34.429 data data data

2015-01 381 2014-12-13 06:58:00 34.429 data data data

2015-02 577 2015-01-13 06:58:00 34.429 data data data

2015-03 729 2015-02-12 20:15:00 69.000 call landline landline

7. Get the sum of the durations per month

In [41]:

data.groupby('month')['duration'].sum()

Out[41]:

month
2014-11 26639.441
2014-12 14641.870
2015-01 18223.299
2015-02 15522.299
2015-03 22750.441
Name: duration, dtype: float64

8. Get the number of dates / entries in each month

localhost:8888/notebooks/Machine Learning/Python/3.3_group_by.ipynb 6/11


2/24/2020 3.3_group_by - Jupyter Notebook

In [42]:

data.groupby('month')['date'].count()

Out[42]:

month
2014-11 230
2014-12 157
2015-01 205
2015-02 137
2015-03 101
Name: date, dtype: int64

9. What is the sum of durations, for calls only, to each network

In [43]:

data[data['item'] == 'call'].groupby('network')['duration'].sum()

Out[43]:

network
Meteor 7200.0
Tesco 13828.0
Three 36464.0
Vodafone 14621.0
landline 18433.0
voicemail 1775.0
Name: duration, dtype: float64

10. How many calls, sms, and data entries are in each month?

localhost:8888/notebooks/Machine Learning/Python/3.3_group_by.ipynb 7/11


2/24/2020 3.3_group_by - Jupyter Notebook

In [44]:

data.groupby(['month', 'item'])['date'].count()

Out[44]:

month item
2014-11 call 107
data 29
sms 94
2014-12 call 79
data 30
sms 48
2015-01 call 88
data 31
sms 86
2015-02 call 67
data 31
sms 39
2015-03 call 47
data 29
sms 25
Name: date, dtype: int64

11. How many calls, texts, and data are sent per month, split by network_type?

localhost:8888/notebooks/Machine Learning/Python/3.3_group_by.ipynb 8/11


2/24/2020 3.3_group_by - Jupyter Notebook

In [45]:

data.groupby(['month', 'network_type'])['date'].count()

Out[45]:

month network_type
2014-11 data 29
landline 5
mobile 189
special 1
voicemail 6
2014-12 data 30
landline 7
mobile 108
voicemail 8
world 4
2015-01 data 31
landline 11
mobile 160
voicemail 3
2015-02 data 31
landline 8
mobile 90
special 2
voicemail 6
2015-03 data 29
landline 11
mobile 54
voicemail 4
world 3
Name: date, dtype: int64

In [48]:

# produces Pandas Series


#data.groupby('month')['duration'].sum()
# Produces Pandas DataFrame
data.groupby('month')[['duration']].sum()

Out[48]:

duration

month

2014-11 26639.441

2014-12 14641.870

2015-01 18223.299

2015-02 15522.299

2015-03 22750.441

In [1]:

import pandas as pd

localhost:8888/notebooks/Machine Learning/Python/3.3_group_by.ipynb 9/11


2/24/2020 3.3_group_by - Jupyter Notebook

In [6]:

name = pd.Series(['ram','shyam','kiran','rishi'])

In [7]:

name

Out[7]:

0 ram
1 shyam
2 kiran
3 rishi
dtype: object

In [9]:

mark1 = pd.Series([45, 67, 32, 65])

In [10]:

mark2 = pd.Series([77, 34, 72, 55])

In [13]:

df=pd.DataFrame({name, mark1, mark2})

---------------------------------------------------------------------------
TypeError Traceback (most recent call last)
<ipython-input-13-f00c768bf164> in <module>
----> 1 df=pd.DataFrame({name, mark1, mark2})

~\Anaconda3\lib\site-packages\pandas\core\generic.py in __hash__(self)
1884 raise TypeError(
1885 "{0!r} objects are mutable, thus they cannot be"
-> 1886 " hashed".format(self.__class__.__name__)
1887 )
1888

TypeError: 'Series' objects are mutable, thus they cannot be hashed

In [12]:

df

Out[12]:

0 1 2 3

0 ram shyam kiran rishi

1 45 67 32 65

2 77 34 72 55

localhost:8888/notebooks/Machine Learning/Python/3.3_group_by.ipynb 10/11


2/24/2020 3.3_group_by - Jupyter Notebook

In [ ]:

localhost:8888/notebooks/Machine Learning/Python/3.3_group_by.ipynb 11/11

You might also like