0% found this document useful (0 votes)
2 views

Basic Data Processing with Pandas

The document provides an overview of the Pandas library in Python, focusing on data processing techniques such as creating and manipulating Pandas Series and DataFrames. It covers key functionalities including data cleaning, indexing, handling missing data, and performing operations on data. Additionally, it explains how to construct Series from various data types and how to query and analyze data effectively.

Uploaded by

RANJIT Yadav
Copyright
© © All Rights Reserved
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
2 views

Basic Data Processing with Pandas

The document provides an overview of the Pandas library in Python, focusing on data processing techniques such as creating and manipulating Pandas Series and DataFrames. It covers key functionalities including data cleaning, indexing, handling missing data, and performing operations on data. Additionally, it explains how to construct Series from various data types and how to query and analyze data effectively.

Uploaded by

RANJIT Yadav
Copyright
© © All Rights Reserved
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 29

Basic Data Processing with Pandas

Pandas is a Python library used for working with data sets.


It has functions for analysing, cleaning, exploring, and manipulating data.

Why Use Pandas?


Pandas allows us to analyse big data and make conclusions based on statistical theories.
Pandas can clean messy data sets, and make them readable and relevant.
Relevant data is very important in data science.

Pandas Series
A Pandas Series is like a column in a table.
It is a one-dimensional array holding data of any type.

class pandas.Series(data=None, index=None, dtype=None, name=None,


copy=None, fastpath=_NoDefault.no_default): One-dimensional ndarray
with axis labels (including time series).

Constructing Series from a dictionary


>>> d = {'a': 1, 'b': 2, 'c': 3}
>>> ser = pd.Series(data=d)
>>> ser
a 1
b 2
c 3
dtype: int64

>>> d = {'a': 1, 'b': 2, 'c': 3}


>>> ser = pd.Series(data=d, index=['x', 'y', 'z'])
>>> ser
x NaN
y NaN
z NaN

Constructing Series from a list


>>> d = [10 20 30]
>>> ser = pd. Series(data=d)
>>> ser
0 10
1 20
2 30

Constructing Series from a scaler


>>> ser = pd. Series (50, index=['a', 'b', 'c'])
>>> ser
a 50
b 50
c 50

Indexing and Selecting Data


 Positional Indexing: Accessing data by position using integer-based indexing (like lists).
 Label-based Indexing: Accessing data by index labels.
 Boolean Indexing: Selecting data based on a condition

s = pd. Series ([4, 7, -5, 3], index= ['a', 'b', 'c', 'd'])
# Positional indexing
print(s.iloc[0]) # Output: 4

# Label-based indexing
print(s['b']) # Output: 7

# Boolean indexing
print(s[s > 0]) # Output: a 4
# b 7
# d 3
Vectorised Operations
Pandas Series supports vectorised operations, which means operations are applied to each
element of the Series without the need for an explicit loop.
s = pd. Series ([1, 2, 3, 4])
# Arithmetic operations
print (s + 5) # Output: Series with each element incremented by 5
# Element-wise operations
print (s * 2) # Output: Series with each element multiplied by 2

Alignment of Data
When performing operations between two Series, Pandas automatically aligns the data based
on the index. If an index is missing, the result will have NaN for that index.
s1 = pd. Series ([1, 2, 3], index= ['a', 'b', 'c'])
s2 = pd. Series ([4, 5, 6], index= ['b', 'c', 'd'])
# Adding Series
result = s1 + s2
print(result)
a NaN
b 6.0
c 8.0
d NaN
dtype: float64

Handling Missing Data

Pandas Series has built-in methods to handle missing data (NaN values).

 .isnull(): Returns a Boolean Series indicating if each value is NaN.


 .notnull(): Returns the inverse of. isnull().
 .fillna(): Fills missing data with a specified value.
 .dropna(): Removes missing data from the Series.

s = pd.Series([1, None, 3, None, 5])

# Detect missing values


print(s.isnull()) # Output: Boolean Series
0 False
1 True
2 False
3 True
4 False
print(s.notnull()) # Output: Boolean Series
0 True
1 False
2 True
3 False
4 True

# Fill missing values


s_filled = s.fillna(0)
print(s_filled) # Output: Series with `None` replaced by 0
0 1.0
1 0.0
2 3.0
3 0.0
4 5.0

# Drop missing values


s_dropped = s.dropna()
print(s_dropped) # Output: Series with missing values removed
0 1.0
2 3.0
4 5.0

Series Methods

Series objects come with a variety of methods for data manipulation and analysis:

 .sum(): Returns the sum of all elements.


 .mean(): Returns the mean of the elements.
 .sort_values(): Sorts the Series by its values.
 .rank(): Ranks the values in the Series.
 .apply(): Applies a function to each element.

s = pd.Series([5, 3, 8, 2])
print(s.sum()) # Output: 18
print(s.mean()) # Output: 4.5
print(s.sort_values()) # Output: Sorted Series
3 2
1 3
0 5
2 8
print(s.rank()) # Output: Series with ranks
0 3.0
1 2.0
2 4.0
3 1.0
print(s.apply(lambda x: x**2)) # Output: Series with squared values
0 25
1 9
2 64
3 4

Series and Time Series Data:

Pandas Series is well-suited for handling time series data. You can use date ranges as the
index to create a time series.

pandas.date_range(start=None, end=None, periods=None)

pd.date_range(start='1/1/2018', end='1/08/2018')
DatetimeIndex(['2018-01-01', '2018-01-02', '2018-01-03', '2018-01-04',
'2018-01-05', '2018-01-06', '2018-01-07', '2018-01-08'],
dtype='datetime64[ns]', freq='D')
pd.date_range(start='1/1/2018', periods=8)
DatetimeIndex(['2018-01-01', '2018-01-02', '2018-01-03', '2018-01-04',
'2018-01-05', '2018-01-06', '2018-01-07', '2018-01-08'],
dtype='datetime64[ns]', freq='D')
pd.date_range(end='1/1/2018', periods=8)
DatetimeIndex(['2017-12-25', '2017-12-26', '2017-12-27', '2017-12-28',
'2017-12-29', '2017-12-30', '2017-12-31', '2018-01-01'],
dtype='datetime64[ns]', freq='D')

Specify start, end, and periods; the frequency is generated automatically


(linearly spaced).
pd.date_range(start='2018-04-24', end='2018-04-27', periods=3)
DatetimeIndex(['2018-04-24 00:00:00', '2018-04-25 12:00:00',
'2018-04-27 00:00:00'],
dtype='datetime64[ns]', freq=None)

dates = pd.date_range('20230101', periods=6)

s = pd.Series([1, 3, 5, 7, 9, 11], index=dates)

print(s)

2023-01-01 1

2023-01-02 3
2023-01-03 5

2023-01-04 7

2023-01-05 9

2023-01-06 11

Freq: D, dtype: int64

Querying a Series
Querying a Series in Pandas involves selecting and filtering data based on various conditions.
Accessing Elements by Index: You can access elements in a Series using the index, either
by position (integer-based) or by label.
import pandas as pd
s = pd. Series ([10, 20, 30, 40], index= ['a', 'b', 'c', 'd'])
# Access the first element
print(s[0]) # Output: 10
# Access the third element
print(s[2]) # Output: 30

# Access element with index label 'b'


print(s['b']) # Output: 20
# Access multiple elements by label
print(s[['b', 'd']]) # Output: Series with elements 20 and 40
b 20
d 40
Slicing: You can slice a Series to get a subset of the data. Slicing can be done by position or
by label.
# Slice the first three elements

print(s[:3]) # Output: a 10

b 20

c 30

# Slice using labels


print(s['a':'c']) # Output: a 10
b 20
c 30

Boolean Indexing: Boolean indexing allows you to filter a Series based on a condition.
# Query elements greater than 20

print(s[s > 20]) # Output: c 30

d 40

You can combine multiple conditions using logical operators like & (and), | (or), and ~ (not).

# Query elements greater than 10 and less than 40

print (s[(s > 10) & (s < 40)]) # Output: b 20

c 30

Using isin() Method: The isin() method is used to filter data based on whether the elements
are in a list of values.

# Query elements that are either 10 or 30

print (s[s.isin([10, 30])]) # Output: a 10

c 30

Querying with .loc and .iloc:.loc[] is used for label-based indexing,


while .iloc[] is used for position-based indexing.
# Using .loc for label-based indexing

print(s.loc['b':'d']) # Output: b 20

c 30

d 40

# Using .iloc for position-based indexing

print(s.iloc[1:3]) # Output: b 20

c 30

Querying with Conditional Functions: You can also use functions like .where() and
.query() for conditional selection.

.where(): Returns elements that satisfy a condition, otherwise returns NaN.

# Keep elements greater than 20, replace others with NaN


print(s.where(s > 20)) # Output: a NaN

b NaN

c 30.0

d 40.0

.query(): Though primarily used for DataFrames, it can be used with Series in certain contexts
when converting to DataFrame temporarily.

df = s.to_frame(name='value')

result = df.query('value > 20')

print(result)

Handling Missing Data: While querying, you might encounter missing data (NaN).

.isnull() and .notnull(): Check for NaN values.

s = pd.Series([10, None, 30, None, 50])

# Query non-null values

print(s[s.notnull()]) # Output: 0 10.0

2 30.0

3 40.0

.dropna(): Removes NaN values from the Series.

# Drop missing data

print(s.dropna()) # Output: 0 10.0

2 30.0

3 40.0

Pandas DataFrames
A Pandas DataFrame is a 2 dimensional data structure, like a 2 dimensional array, or a table
with rows and columns.

class pandas.DataFrame(data=None, index=None, columns=None, dtype=None, copy=N


one): Two-dimensional, size-mutable, potentially heterogeneous tabular data.

We can create a Pandas DataFrame in the following ways:


 Using Python Dictionary
 Using Python List
 From a File
 Creating an Empty DataFrame

import pandas as pd
# create a dictionary
data = {'Name': ['John', 'Alice', 'Bob'],
'Age': [25, 30, 35],
'City': ['New York', 'London', 'Paris']}
# create a dataframe from the dictionary
df = pd.DataFrame(data)
print(df) // output: Name Age City
0 John 25 New York
1 Alice 30 London
2 Bob 35 Paris

# create a two-dimensional list

data = [['John', 25, 'New York'],

['Alice', 30, 'London'],

['Bob', 35, 'Paris']]

# create a DataFrame from the list

df = pd.DataFrame(data, columns= ['Name', 'Age', 'City'])

print(df) // output: Name Age City


0 John 25 New York
1 Alice 30 London
2 Bob 35 Paris

# load data from a CSV file

df = pd.read_csv('data.csv')
print(df)

# create an empty DataFrame

df = pd.DataFrame()

print(df) // output: Empty DataFrame

Columns: []

Index: []

Basic DataFrame Operations


Viewing Data:

DataFrame.head(n=5): Return the first n rows.

For negative values of n, this function returns all rows except the last |n| rows
If n is larger than the number of rows, this function returns all rows.

Parameters:

df = pd.DataFrame({'animal': ['alligator', 'bee', 'falcon', 'lion',


... 'monkey', 'parrot', 'shark', 'whale', 'zebra']})
>>> df
animal
0 alligator
1 bee
2 falcon
3 lion
4 monkey
5 parrot
6 shark
7 whale
8 zebra

>>> df.head()
animal
0 alligator
1 bee
2 falcon
3 lion
4 monkey

>>> df.head(3)
animal
0 alligator
1 bee
2 falcon

>>> df.head(-3)
animal
0 alligator
1 bee
2 falcon
3 lion
4 monkey
5 parrot

DataFrame.tail(n=5): Return the last n rows.

For negative values of n, this function returns all rows except the first n rows.
If n is larger than the number of rows, this function returns all rows.
>>> df.tail()
animal
4 monkey
5 parrot
6 shark
7 whale
8 zebra

>>> df.tail(3)
animal
6 shark
7 whale
8 zebra

>>> df.tail(-3)
animal
3 lion
4 monkey
5 parrot
6 shark
7 whale
8 zebra

DataFrame.info(): Print a concise summary of a DataFrame. This method prints


information about a DataFrame including the index dtype and columns, non-null values and
memory usage.
>>> df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5 entries, 0 to 4
Data columns (total 3 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 int_col 5 non-null int64
1 text_col 5 non-null object
2 float_col 5 non-null float64
dtypes: float64(1), int64(1), object(1)
memory usage: 248.0+ bytes

Accessing Columns and Rows


Single column: df['column_name'] or df.column_name
Multiple columns: df[['column1', 'column2']]
data = {'Name':['Jai', 'Princi', 'Gaurav', 'Anuj'],
'Age':[27, 24, 22, 32],
'Address':['Delhi', 'Kanpur', 'Allahabad', 'Kannauj'],
'Qualification':['Msc', 'MA', 'MCA', 'Phd']}

# Convert the dictionary into DataFrame


df = pd.DataFrame(data)

# select two columns


print (df[['Name', 'Qualification']])

Column addition: In Order to add a column in Pandas DataFrame, we can declare a new
list as a column and add to existing Dataframe.
data = {'Name': ['Jai', 'Princi', 'Gaurav', 'Anuj'],
'Height': [5.1, 6.2, 5.1, 5.2],
'Qualification': ['Msc', 'MA', 'Msc', 'Msc']}

# Convert the dictionary into DataFrame


df = pd.DataFrame(data)

# Declare a list that is to be converted into a column


address = ['Delhi', 'Bangalore', 'Chennai', 'Patna']

# Using 'Address' as the column name


# and equating it to the list
df['Address'] = address

# Observe the result


print(df)
Column Deletion: In Order to delete a column in Pandas DataFrame, we can use the
drop () method.

>>> df = pd.DataFrame(np.arange(12).reshape(3, 4),


... columns= ['A', 'B', 'C', 'D'])
>>> df
A B C D
0 0 1 2 3
1 4 5 6 7
2 8 9 10 11

>>> df.drop(columns=['B', 'C'])


A D
0 0 3
1 4 7
2 8 11

Drop a row by index

>>> df.drop([0, 1])


A B C D
2 8 9 10 11

Selecting Rows

By label: df.loc[label]: Label-based data selector. The end index is included during
slicing.

By index: df.iloc[index]: Index-based data selector. The end index is excluded during
slicing.

What is loc Method?


The loc[ ] is a label-based method used for selecting data as well as updating it. This is done
by passing the name (label) of the row or column that we wish to select.

Syntax: loc[row_label, column_label]

#Creating a Sample DataFrame


df = pd.DataFrame({
'id': [ 101, 102, 103, 104, 105, 106, 107],
'age': [ 20, 22, 23, 21, 22, 21, 25],
'group': [ 'A', 'B', 'C', 'C', 'B', 'A', 'A'],
'city': [ 'Tier1', 'Tier2', 'Tier2', 'Tier3', 'Tier1', 'Tier2', 'Tier1'],
'gender': [ 'M', 'F', 'F', 'M', 'M', 'M', 'F'],
'degree': [ 'econ', 'econ', 'maths', 'finance', 'history', 'science', 'marketing']
})

Selecting a row using loc[ ]


#Selecting a row with label
df.loc[102]

Slicing using loc[ ]


#Slicing using loc[]
df.loc[101:103]

Filtering rows using loc[ ]

#Selecting all rows with a given condition


df.loc[df.age >= 22]
#Selecting rows with multiple conditions
df.loc[(df.age >= 22) & (df.city == 'Tier1')]

Filtering columns using loc[ ]

#Selecting columns with a given condition


df.loc[(df.gender == 'M'), ['group', 'degree']]

Updating columns using loc[ ]

#Updating a column with a given condition


df.loc[(df.gender == 'M'), ['group']] = 'A'
df
#Updating multiple columns with a given condition
df.loc[(df.gender == 'F'), ['group', 'city']] = ['B','Tier2']
df

What is iloc Method?


The iloc[ ] is an index-based method used for data selection. In this case, we pass the
positions of the row or column that we wish to select (0-based integer index).

Syntax: iloc[row_position, column_position]


Selecting a row using iloc

#Selecting rows with index


df.iloc[[2,4]] // 2 and 4 refer to the index number, and hence the second and the fourth row
would be displayed

Selecting rows and columns using iloc

#Selecting rows with particular index and particular columns


df.iloc[[0,4],[1,3]] // [0,4] refers to index numbers 0 and 4 for rows and [1,3] refers to
index numbers for columns.

Slicing using iloc

#Selecting range of rows


data.iloc[1:5] // index number 5, that is the endpoint, is not included
We can also select a range of rows and columns:
#Selecting range of rows and columns
df.iloc[1:3,2:4]

Joining and Merging Dataframes


Pandas provides several functions for this purpose, primarily merge( ) and concat()

pandas.merge(left, right, how='inner', on=None, left_on=None, righ


t_on=None, left_index=False, right_index=False, sort=False, suffixe
s=('_x', '_y'), copy=None, indicator=False, validate=None): Merge
DataFrame or named Series objects with a database-style join. A named Series
object is treated as a DataFrame with a single named column.

how: {‘left’, ‘right’, ‘outer’, ‘inner’, ‘cross’}, default ‘inner’


Type of merge to be performed.

 left: use only keys from left frame, similar to a SQL left outer join;
preserve key order.
 right: use only keys from right frame, similar to a SQL right outer
join; preserve key order.
 outer: use union of keys from both frames, similar to a SQL full outer
join; sort keys lexicographically.
 inner: use intersection of keys from both frames, similar to a SQL
inner join; preserve the order of the left keys.
 cross: creates the cartesian product from both frames, preserves
the order of the left keys.

on: label or list: Column(s) to join on. If not specified, the merge will be on common
columns.
 Column or index level names to join on. These must be found in
both DataFrames. If on is None and not merging on indexes then
this defaults to the intersection of the columns in both DataFrames.
 left_on: Column(s) from the left DataFrame to use as keys.
 right_on: Column(s) from the right DataFrame to use as keys.
 suffixes: Suffix to apply to overlapping column names in the left and right side.

import pandas as pd key value1


df1 = pd.DataFrame({ 0 A 1
'key': ['A', 'B', 'C', 'D'], 1 B 2
'value1': [1, 2, 3, 4] 2 C 3
}) 3 D 4
df2 = pd.DataFrame({ key value2
'key': ['B', 'D', 'E', 'F'], 0 B 5
'value2': [5, 6, 7, 8] 1 D 6
}) 2 E 7
result = pd.merge(df1, df2) 3 F 8
print(result) key value1 value2
0 B 2 5
1 D 4 6

import pandas as pd key value1


df1 = pd.DataFrame({ 0 A 1
'key': ['A', 'B', 'C', 'D'], 1 B 2
'value1': [1, 2, 3, 4] 2 C 3
}) 3 D 4
df2 = pd.DataFrame({ key value2
'key': ['B', 'D', 'E', 'F'], 0 B 5
'value2': [5, 6, 7, 8] 1 D 6
}) 2 E 7
result = pd.merge(df1, df2, on=’key’) 3 F 8
print(result) key value1 value2
0 B 2 5
1 D 4 6

result = pd.merge(df1, df2, how='left') key value1


print(result) 0 A 1
1 B 2
2 C 3
3 D 4
key value2
0 B 5
1 D 6
2 E 7
3 F 8
key value1 value2
0 A 1 NaN
1 B 2 5.0
2 C 3 NaN
3 D 4 6.0

result = pd.merge(df1, df2, how='right') key value1


print(result) 0 A 1
1 B 2
2 C 3
3 D 4
key value2
0 B 5
1 D 6
2 E 7
3 F 8
key value1 value2
0 B 2.0 5
1 D 4.0 6
2 E NaN 7
3 F NaN 8

result = pd.merge(df1, df2, how='outer') key value1


print(result) 0 A 1
1 B 2
2 C 3
3 D 4
key value2
0 B 5
1 D 6
2 E 7
3 F 8
key value1 value2
0 A 1.0 NaN
1 B 2.0 5.0
2 C 3.0 NaN
3 D 4.0 6.0
4 E NaN 7.0
5 F NaN 8.0

result = pd.merge(df1, df2, how='inner') key value1


print(result) 0 A 1
1 B 2
2 C 3
3 D 4
key value2
0 B 5
1 D 6
2 E 7
3 F 8
key value1 value2
0 B 2 5
1 D 4 6

result = pd.merge(df1, df2, how='cross') key value1


print(result) 0 A 1
1 B 2
2 C 3
3 D 4
key value2
0 B 5
1 D 6
2 E 7
3 F 8
key_x value1 key_y value2
0 A 1 B 5
1 A 1 D 6
2 A 1 E 7
3 A 1 F 8
4 B 2 B 5
5 B 2 D 6
6 B 2 E 7
7 B 2 F 8
8 C 3 B 5
9 C 3 D 6
10 C 3 E 7
11 C 3 F 8
12 D 4 B 5
13 D 4 D 6
14 D 4 E 7
15 D 4 F 8

>>> df1 = pd.DataFrame({'lkey': ['foo', 'bar', 'baz', 'foo'],


... 'value': [1, 2, 3, 5]})
>>> df2 = pd.DataFrame({'rkey': ['foo', 'bar', 'baz', 'foo'],
... 'value': [5, 6, 7, 8]})
>>> df1.merge(df2, left_on='lkey', right_on='rkey')
lkey value_x rkey value_y
0 foo 1 foo 5
1 foo 1 foo 8
2 bar 2 bar 6
3 baz 3 baz 7
4 foo 5 foo 5
5 foo 5 foo 8
pandas.concat
pandas.concat(objs, *, axis=0, join='outer', ignore_index=False, keys=None, levels=None, names=
None, verify_integrity=False, sort=False, copy=None): Concatenate pandas objects along a
particular axis.

axis{0/’index’, 1/’columns’}, default 0


Axis to concatenate along (0 for rows, 1 for columns)
join{‘inner’, ‘outer’}, default ‘outer’
How to handle indexes on other axis (or axes).
ignore_index:bool, default False
If True, do not use the index values along the concatenation axis.
The resulting axis will be labeled 0, …, n - 1. This is useful if you are
concatenating objects where the concatenation axis does not have
meaningful indexing information. Note the index values on the other
axes are still respected in the join.
import pandas as pd letter number
df1 = pd.DataFrame([['a', 1], ['b', 2]], 0 a 1
columns=['letter', 'number']) 1 b 2
df2 = pd.DataFrame([['c', 3], ['d', 4]], 0 c 3
columns=['letter', 'number']) 1 d 4
print(pd.concat([df1, df2]))

import pandas as pd letter number letter number


df1 = pd.DataFrame([['a', 1], ['b', 2]], 0 a 1 c 3
columns=['letter', 'number']) 1 b 2 d 4
df2 = pd.DataFrame([['c', 3], ['d', 4]],
columns=['letter', 'number'])
print(pd.concat([df1, df2], axis=1))

import pandas as pd letter number


df1 = pd.DataFrame([['a', 1], ['b', 2]], 0 a 1
columns=['letter', 'number']) 1 b 2
df2 = pd.DataFrame([['c', 3], ['d', 4]], 2 c 3
columns=['letter', 'number']) 3 d 4
print(pd.concat([df1, df2], ignore_index
=True))

Pivot Table
Pivot tables in Pandas are powerful tools for summarizing data, allowing you to aggregate and
reshape your data easily. The pivot_table() function in Pandas is similar to Excel's pivot tables.

pd.pivot_table(data, values=None, index=None, columns=None, aggfunc='mean',


fill_value=None, margins=False, margins_name='All')
 data: The DataFrame to pivot.
 values: Column(s) to aggregate.
 index: Column(s) to group by on the rows.
 columns: Column(s) to group by on the columns.
 aggfunc: Function to aggregate values (e.g., mean, sum, count, etc.). Default is mean.
 fill_value: Value to replace missing values.
 margins: Add all rows/columns (like Excel's pivot table "Grand Totals").
 margins_name: Name of the margins row/column.

import pandas as pd Region Product Sales Quantity


0 East A 100 10
data = { 1 East B 150 15
'Region': ['East', 'East', 'West', 'West', 2 West A 200 20
'East', 'West'], 3 West B 250 25
'Product': ['A', 'B', 'A', 'B', 'A', 'B'], 4 East A 300 30
'Sales': [100, 150, 200, 250, 300, 400], 5 West B 400 40
'Quantity': [10, 15, 20, 25, 30, 40] Product A B
} Region
East 400 150
df = pd.DataFrame(data) West 200 650
print(df)
pivot = pd.pivot_table(df, values='Sales',
index='Region', columns='Product',
aggfunc='sum')
print(pivot)

pivot = pd.pivot_table(df, values='Sales', sum len


index='Region', columns='Product', Product A B A B
aggfunc=[sum, len]) Region
print(pivot) East 400 150 2 1
West 200 650 1 2

pivot = pd.pivot_table(df, values='Sales', Sales


index=['Region', 'Product'], aggfunc='sum') Region Product
print(pivot) East A 400
B 150
West A 200
B 650

pivot = pd.pivot_table(df, values='Sales', Product A B All


index='Region', columns='Product', Region
aggfunc='sum', margins=True) East 400 150 550
print(pivot) West 200 650 850
All 600 800 1400

pivot = pd.pivot_table(df, values=['Sales', Quantity Sales


'Quantity'], index='Region', columns='Product', Product A B A B
aggfunc='sum') Region
print(pivot) East 40 15 400 150
West 20 65 200 650

Aggregation and Grouping


DataFrame.agg(func=None, axis=0): Aggregate using one or more operations over the
specified axis.

Func: Function to use for aggregating the data. list of functions and/or
function names, e.g. [‘sum’, 'mean']

axis{0 or ‘index’, 1 or ‘columns’}, default 0


If 0 or ‘index’: apply function to each column. If 1 or ‘columns’:
apply function to each row.
>>> df = pd.DataFrame([[1, 2, 3],
... [4, 5, 6],
... [7, 8, 9],
... [np.nan, np.nan, np.nan]],
... columns=['A', 'B', 'C'])

Aggregate these functions over the rows.


>>>df.agg(['sum', 'min'])
A B C
sum 12.0 15.0 18.0
min 1.0 2.0 3.0

Different aggregations per column.

>>>df.agg({'A' : ['sum', 'min'], 'B' : ['min', 'max']})


A B
sum 12.0 NaN
min 1.0 2.0
max NaN 8.0

Aggregate over the columns.

>>>df.agg("mean", axis="columns")
0 2.0
1 5.0
2 8.0
3 NaN

DataFrame.groupby: The groupby() method is used to group data based on one or more
columns, and then you can apply aggregation functions such as mean(), sum(), count(), etc.

Syntax:
df.groupby('column_name').aggregate_function()
Example:
import pandas as pd

# Sample DataFrame
data = {
'Department': ['HR', 'HR', 'IT', 'IT', 'Finance', 'Finance'],
'Employee': ['John', 'Anna', 'Mike', 'Sara', 'Paul', 'Kate'],
'Salary': [50000, 60000, 70000, 80000, 75000, 65000]
}

df = pd.DataFrame(data)

# Group by Department and calculate the mean salary for each department
grouped_df = df.groupby('Department')['Salary'].mean()

print(grouped_df)

output:
Department
Finance 70000.0
HR 55000.0
IT 75000.0
Name: Salary, dtype: float64

Common Aggregation Functions:

 .sum(): Sum of values.


 .mean(): Mean of values.
 .count(): Count of non-null values.
 .min(), .max(): Minimum and maximum values.
 .std(), .var(): Standard deviation and variance.

Multiple Aggregations:

You can also apply multiple aggregation functions to the groups:

# Applying multiple aggregation functions


grouped_df = df.groupby('Department')['Salary'].agg(['mean', 'min', 'max'])

print(grouped_df)

output:

mean min max


Department
Finance 70000 65000 75000
HR 55000 50000 60000
IT 75000 70000 80000

Grouping by Multiple Columns:

You can group by multiple columns as well:

# Group by Department and Employee


grouped_df = df.groupby(['Department', 'Employee'])['Salary'].sum()

print(grouped_df)

output:

Department Employee

Finance Kate 65000

Paul 75000

HR Anna 60000

John 50000

IT Mike 70000

Sara 80000
Name: Salary, dtype: int64

Summary tables in pandas


In pandas, you can create summary tables (or summary statistics tables) using several built-in
methods. These tables provide key statistics such as mean, count, sum, etc., for each group or
the entire dataset.

1. Basic Summary Statistics Table: .describe()

The .describe() method provides a summary of statistics like count, mean, standard
deviation, min, and max for numeric columns.

import pandas as pd

# Sample DataFrame
data = {
'Department': ['HR', 'HR', 'IT', 'IT', 'Finance', 'Finance'],
'Salary': [50000, 60000, 70000, 80000, 75000, 65000],
'Years_Experience': [5, 7, 3, 10, 8, 6]
}
df = pd.DataFrame(data)
# Summary statistics
summary_table = df.describe()
print(summary_table)

output:

Salary Years_ Experience

count 6.000000 6.000000

mean 66666.666667 6.500000

std 11690.292990 2.338090

min 50000.000000 3.000000

25% 60000.000000 5.250000

50% 67500.000000 6.500000

75% 75000.000000 7.750000


max 80000.000000 10.000000

2. Summary Table by Group: .groupby() + Aggregation

You can use the groupby() function combined with aggregation methods to generate
summary tables based on categorical columns.

Example: Summary by Department

# Group by Department and calculate summary statistics


grouped_summary = df.groupby('Department').agg({
'Salary': ['mean', 'sum', 'min', 'max'],
'Years_Experience': ['mean', 'sum', 'min', 'max']
})

print(grouped_summary)

output:
Salary Years_Experience
mean sum min max mean sum min max
Department
Finance 70000.0 140000 65000 75000 7.0 14 6 8
HR 55000.0 110000 50000 60000 6.0 12 5 7
IT 75000.0 150000 70000 80000 6.5 13 3 10
3. Crosstab Summary: pd.crosstab()

pd.crosstab() creates a summary table (cross-tabulation) of counts or aggregations


between two or more categorical columns.

Example: Cross-tabulation of Departments and Years of Experience

# Crosstab of Department and Years_Experience

crosstab_summary = pd.crosstab(df['Department'], df['Years_Experience'])

print(crosstab_summary)

output:

Years_Experience 3 5 6 7 8 10

Department

Finance 0 0 1 0 1 0

HR 0 1 0 1 0 0

IT 1 0 0 0 0 1

In this case, it shows the number of employees in each department with various years of experience.

4. Pivot Tables: pd.pivot_table()

Pivot tables provide more flexibility by allowing you to compute aggregated values across
multiple dimensions.

Example: Pivot Table of Salary by Department and Years of Experience

# Pivot table for Salary with Department and Years_Experience

pivot_summary = pd.pivot_table(df, values='Salary', index='Department',


columns='Years_Experience', aggfunc='mean')

print(pivot_summary)

output:

Years_Experience 3 5 6 7 8 10

Department

Finance NaN NaN 65000.0 NaN 75000.0 NaN

HR NaN 50000.0 NaN 60000.0 NaN NaN

IT 70000.0 NaN NaN NaN NaN 80000.0


5. Custom Summary Table Using .apply()

If you need a highly customized summary, you can use .apply() to create a summary table
with user-defined functions.

Example: Custom Summary for Salary

# Custom function for range (max - min)

def salary_range(x):

return x.max() - x.min()

# Apply custom aggregation to Salary

custom_summary = df.groupby('Department').Salary.apply(salary_range)

print(custom_summary)

output:

Department

Finance 10000

HR 10000

IT 10000
Name: Salary, dtype: int64

You might also like