Group - by Python Code
Group - by Python Code
Simple aggregations can give you a flavor of your dataset, but often we would prefer to aggregate conditionally
on some label or index: this is implemented in the so-called groupby operation. The name "group by" comes
from a command in the SQL database language, but it is perhaps more illuminative to think of it in the terms
first coined by Hadley Wickham of Rstats fame: split, apply, combine.
In [1]:
import numpy as np
import pandas as pd
In [8]:
Out[8]:
key data
0 A 0
1 B 1
2 C 2
3 A 3
4 B 4
5 C 5
6 B 6
In [4]:
df.groupby('key')
Out[4]:
Notice that what is returned is not a set of DataFrames, but a DataFrameGroupBy object. This object is where
the magic is: you can think of it as a special view of the DataFrame, which is poised to dig into the groups but
does no actual computation until the aggregation is applied.
In [9]:
df.groupby('key').sum()
Out[9]:
data
key
A 3
B 11
C 7
In [13]:
A shape=(2, 2)
B shape=(3, 2)
C shape=(2, 2)
In [15]:
nyc_flight = pd.read_csv('data/nyc_flights_2013.csv')
In [27]:
nyc_flight.head(3)
Out[27]:
year month day dep_time dep_delay arr_time arr_delay carrier tailnum flight origin
In [26]:
n_by_flight = nyc_flight.groupby("carrier")["carrier"].count()
n_by_flight
Out[26]:
carrier
9E 2982
AA 5094
AS 115
B6 8144
DL 7160
EV 8298
F9 107
FL 527
HA 50
MQ 4137
OO 1
UA 8917
US 3152
VX 716
WN 1915
YV 103
Name: carrier, dtype: int64
In [3]:
data = pd.read_csv('data/phone_data.csv')
data.head(3)
Out[3]:
In [30]:
data.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 830 entries, 0 to 829
Data columns (total 7 columns):
index 830 non-null int64
date 830 non-null object
duration 830 non-null float64
item 830 non-null object
month 830 non-null object
network 830 non-null object
network_type 830 non-null object
dtypes: float64(1), int64(1), object(5)
memory usage: 45.5+ KB
In [4]:
import dateutil
In [33]:
data.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 830 entries, 0 to 829
Data columns (total 7 columns):
index 830 non-null int64
date 830 non-null datetime64[ns]
duration 830 non-null float64
item 830 non-null object
month 830 non-null object
network 830 non-null object
network_type 830 non-null object
dtypes: datetime64[ns](1), float64(1), int64(1), object(4)
memory usage: 45.5+ KB
In [35]:
data['item'].count()
Out[35]:
830
In [36]:
data['duration'].max()
Out[36]:
10528.0
In [7]:
data[data['item']=='call']['duration'].sum()
Out[7]:
92321.0
In [38]:
data['month'].value_counts()
Out[38]:
2014-11 230
2015-01 205
2014-12 157
2015-02 137
2015-03 101
Name: month, dtype: int64
In [39]:
data['network'].nunique()
Out[39]:
In [40]:
data.groupby('month').first()
Out[40]:
month
In [41]:
data.groupby('month')['duration'].sum()
Out[41]:
month
2014-11 26639.441
2014-12 14641.870
2015-01 18223.299
2015-02 15522.299
2015-03 22750.441
Name: duration, dtype: float64
In [42]:
data.groupby('month')['date'].count()
Out[42]:
month
2014-11 230
2014-12 157
2015-01 205
2015-02 137
2015-03 101
Name: date, dtype: int64
In [43]:
data[data['item'] == 'call'].groupby('network')['duration'].sum()
Out[43]:
network
Meteor 7200.0
Tesco 13828.0
Three 36464.0
Vodafone 14621.0
landline 18433.0
voicemail 1775.0
Name: duration, dtype: float64
10. How many calls, sms, and data entries are in each month?
In [44]:
data.groupby(['month', 'item'])['date'].count()
Out[44]:
month item
2014-11 call 107
data 29
sms 94
2014-12 call 79
data 30
sms 48
2015-01 call 88
data 31
sms 86
2015-02 call 67
data 31
sms 39
2015-03 call 47
data 29
sms 25
Name: date, dtype: int64
11. How many calls, texts, and data are sent per month, split by network_type?
In [45]:
data.groupby(['month', 'network_type'])['date'].count()
Out[45]:
month network_type
2014-11 data 29
landline 5
mobile 189
special 1
voicemail 6
2014-12 data 30
landline 7
mobile 108
voicemail 8
world 4
2015-01 data 31
landline 11
mobile 160
voicemail 3
2015-02 data 31
landline 8
mobile 90
special 2
voicemail 6
2015-03 data 29
landline 11
mobile 54
voicemail 4
world 3
Name: date, dtype: int64
In [48]:
Out[48]:
duration
month
2014-11 26639.441
2014-12 14641.870
2015-01 18223.299
2015-02 15522.299
2015-03 22750.441
In [1]:
import pandas as pd
In [6]:
name = pd.Series(['ram','shyam','kiran','rishi'])
In [7]:
name
Out[7]:
0 ram
1 shyam
2 kiran
3 rishi
dtype: object
In [9]:
In [10]:
In [13]:
---------------------------------------------------------------------------
TypeError Traceback (most recent call last)
<ipython-input-13-f00c768bf164> in <module>
----> 1 df=pd.DataFrame({name, mark1, mark2})
~\Anaconda3\lib\site-packages\pandas\core\generic.py in __hash__(self)
1884 raise TypeError(
1885 "{0!r} objects are mutable, thus they cannot be"
-> 1886 " hashed".format(self.__class__.__name__)
1887 )
1888
In [12]:
df
Out[12]:
0 1 2 3
1 45 67 32 65
2 77 34 72 55
In [ ]: