0% found this document useful (0 votes)

8 views29 pages

Basic Data Processing with Pandas

The document provides an overview of the Pandas library in Python, focusing on data processing techniques such as creating and manipulating Pandas Series and DataFrames. It covers key functionalities including data cleaning, indexing, handling missing data, and performing operations on data. Additionally, it explains how to construct Series from various data types and how to query and analyze data effectively.

Uploaded by

RANJIT Yadav

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

8 views29 pages

Basic Data Processing with Pandas

Uploaded by

RANJIT Yadav

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

You are on page 1/ 29

Basic Data Processing with Pandas

Pandas is a Python library used for working with data sets.

It has functions for analysing, cleaning, exploring, and manipulating data.

Why Use Pandas?

Pandas allows us to analyse big data and make conclusions based on statistical theories.
Pandas can clean messy data sets, and make them readable and relevant.
Relevant data is very important in data science.

Pandas Series
A Pandas Series is like a column in a table.
It is a one-dimensional array holding data of any type.

class pandas.Series(data=None, index=None, dtype=None, name=None,

copy=None, fastpath=_NoDefault.no_default): One-dimensional ndarray
with axis labels (including time series).

Constructing Series from a dictionary

>>> d = {'a': 1, 'b': 2, 'c': 3}
>>> ser = pd.Series(data=d)
>>> ser
a 1
b 2
c 3
dtype: int64

>>> d = {'a': 1, 'b': 2, 'c': 3}

>>> ser = pd.Series(data=d, index=['x', 'y', 'z'])
>>> ser
x NaN
y NaN
z NaN

Constructing Series from a list

>>> d = [10 20 30]
>>> ser = pd. Series(data=d)
>>> ser
0 10
1 20
2 30

Constructing Series from a scaler

>>> ser = pd. Series (50, index=['a', 'b', 'c'])
>>> ser
a 50
b 50
c 50

Indexing and Selecting Data

 Positional Indexing: Accessing data by position using integer-based indexing (like lists).
 Label-based Indexing: Accessing data by index labels.
 Boolean Indexing: Selecting data based on a condition

s = pd. Series ([4, 7, -5, 3], index= ['a', 'b', 'c', 'd'])
# Positional indexing
print(s.iloc[0]) # Output: 4

# Label-based indexing
print(s['b']) # Output: 7

# Boolean indexing
print(s[s > 0]) # Output: a 4
# b 7
# d 3
Vectorised Operations
Pandas Series supports vectorised operations, which means operations are applied to each
element of the Series without the need for an explicit loop.
s = pd. Series ([1, 2, 3, 4])
# Arithmetic operations
print (s + 5) # Output: Series with each element incremented by 5
# Element-wise operations
print (s * 2) # Output: Series with each element multiplied by 2

Alignment of Data
When performing operations between two Series, Pandas automatically aligns the data based
on the index. If an index is missing, the result will have NaN for that index.
s1 = pd. Series ([1, 2, 3], index= ['a', 'b', 'c'])
s2 = pd. Series ([4, 5, 6], index= ['b', 'c', 'd'])
# Adding Series
result = s1 + s2
print(result)
a NaN
b 6.0
c 8.0
d NaN
dtype: float64

Handling Missing Data

Pandas Series has built-in methods to handle missing data (NaN values).

 .isnull(): Returns a Boolean Series indicating if each value is NaN.

 .notnull(): Returns the inverse of. isnull().
 .fillna(): Fills missing data with a specified value.
 .dropna(): Removes missing data from the Series.

s = pd.Series([1, None, 3, None, 5])

# Detect missing values

print(s.isnull()) # Output: Boolean Series
0 False
1 True
2 False
3 True
4 False
print(s.notnull()) # Output: Boolean Series
0 True
1 False
2 True
3 False
4 True

# Fill missing values

s_filled = s.fillna(0)
print(s_filled) # Output: Series with `None` replaced by 0
0 1.0
1 0.0
2 3.0
3 0.0
4 5.0

# Drop missing values

s_dropped = s.dropna()
print(s_dropped) # Output: Series with missing values removed
0 1.0
2 3.0
4 5.0

Series Methods

Series objects come with a variety of methods for data manipulation and analysis:

 .sum(): Returns the sum of all elements.

 .mean(): Returns the mean of the elements.
 .sort_values(): Sorts the Series by its values.
 .rank(): Ranks the values in the Series.
 .apply(): Applies a function to each element.

s = pd.Series([5, 3, 8, 2])
print(s.sum()) # Output: 18
print(s.mean()) # Output: 4.5
print(s.sort_values()) # Output: Sorted Series
3 2
1 3
0 5
2 8
print(s.rank()) # Output: Series with ranks
0 3.0
1 2.0
2 4.0
3 1.0
print(s.apply(lambda x: x**2)) # Output: Series with squared values
0 25
1 9
2 64
3 4

Series and Time Series Data:

Pandas Series is well-suited for handling time series data. You can use date ranges as the
index to create a time series.

pandas.date_range(start=None, end=None, periods=None)

pd.date_range(start='1/1/2018', end='1/08/2018')
DatetimeIndex(['2018-01-01', '2018-01-02', '2018-01-03', '2018-01-04',
'2018-01-05', '2018-01-06', '2018-01-07', '2018-01-08'],
dtype='datetime64[ns]', freq='D')
pd.date_range(start='1/1/2018', periods=8)
DatetimeIndex(['2018-01-01', '2018-01-02', '2018-01-03', '2018-01-04',
'2018-01-05', '2018-01-06', '2018-01-07', '2018-01-08'],
dtype='datetime64[ns]', freq='D')
pd.date_range(end='1/1/2018', periods=8)
DatetimeIndex(['2017-12-25', '2017-12-26', '2017-12-27', '2017-12-28',
'2017-12-29', '2017-12-30', '2017-12-31', '2018-01-01'],
dtype='datetime64[ns]', freq='D')

Specify start, end, and periods; the frequency is generated automatically

(linearly spaced).
pd.date_range(start='2018-04-24', end='2018-04-27', periods=3)
DatetimeIndex(['2018-04-24 00:00:00', '2018-04-25 12:00:00',
'2018-04-27 00:00:00'],
dtype='datetime64[ns]', freq=None)

dates = pd.date_range('20230101', periods=6)

s = pd.Series([1, 3, 5, 7, 9, 11], index=dates)

print(s)

2023-01-01 1

2023-01-02 3
2023-01-03 5

2023-01-04 7

2023-01-05 9

2023-01-06 11

Freq: D, dtype: int64

Querying a Series
Querying a Series in Pandas involves selecting and filtering data based on various conditions.
Accessing Elements by Index: You can access elements in a Series using the index, either
by position (integer-based) or by label.
import pandas as pd
s = pd. Series ([10, 20, 30, 40], index= ['a', 'b', 'c', 'd'])
# Access the first element
print(s[0]) # Output: 10
# Access the third element
print(s[2]) # Output: 30

# Access element with index label 'b'

print(s['b']) # Output: 20
# Access multiple elements by label
print(s[['b', 'd']]) # Output: Series with elements 20 and 40
b 20
d 40
Slicing: You can slice a Series to get a subset of the data. Slicing can be done by position or
by label.
# Slice the first three elements

print(s[:3]) # Output: a 10

b 20

c 30

# Slice using labels

print(s['a':'c']) # Output: a 10
b 20
c 30

Boolean Indexing: Boolean indexing allows you to filter a Series based on a condition.
# Query elements greater than 20

print(s[s > 20]) # Output: c 30

d 40

You can combine multiple conditions using logical operators like & (and), | (or), and ~ (not).

# Query elements greater than 10 and less than 40

print (s[(s > 10) & (s < 40)]) # Output: b 20

c 30

Using isin() Method: The isin() method is used to filter data based on whether the elements
are in a list of values.

# Query elements that are either 10 or 30

print (s[s.isin([10, 30])]) # Output: a 10

c 30

Querying with .loc and .iloc:.loc[] is used for label-based indexing,

while .iloc[] is used for position-based indexing.
# Using .loc for label-based indexing

print(s.loc['b':'d']) # Output: b 20

c 30

d 40

# Using .iloc for position-based indexing

print(s.iloc[1:3]) # Output: b 20

c 30

Querying with Conditional Functions: You can also use functions like .where() and
.query() for conditional selection.

.where(): Returns elements that satisfy a condition, otherwise returns NaN.

# Keep elements greater than 20, replace others with NaN

print(s.where(s > 20)) # Output: a NaN

b NaN

c 30.0

d 40.0

.query(): Though primarily used for DataFrames, it can be used with Series in certain contexts
when converting to DataFrame temporarily.

df = s.to_frame(name='value')

result = df.query('value > 20')

print(result)

Handling Missing Data: While querying, you might encounter missing data (NaN).

.isnull() and .notnull(): Check for NaN values.

s = pd.Series([10, None, 30, None, 50])

# Query non-null values

print(s[s.notnull()]) # Output: 0 10.0

2 30.0

3 40.0

.dropna(): Removes NaN values from the Series.

# Drop missing data

print(s.dropna()) # Output: 0 10.0

2 30.0

3 40.0

Pandas DataFrames
A Pandas DataFrame is a 2 dimensional data structure, like a 2 dimensional array, or a table
with rows and columns.

class pandas.DataFrame(data=None, index=None, columns=None, dtype=None, copy=N

one): Two-dimensional, size-mutable, potentially heterogeneous tabular data.

We can create a Pandas DataFrame in the following ways:

 Using Python Dictionary
 Using Python List
 From a File
 Creating an Empty DataFrame

import pandas as pd
# create a dictionary
data = {'Name': ['John', 'Alice', 'Bob'],
'Age': [25, 30, 35],
'City': ['New York', 'London', 'Paris']}
# create a dataframe from the dictionary
df = pd.DataFrame(data)
print(df) // output: Name Age City
0 John 25 New York
1 Alice 30 London
2 Bob 35 Paris

# create a two-dimensional list

data = [['John', 25, 'New York'],

['Alice', 30, 'London'],

['Bob', 35, 'Paris']]

# create a DataFrame from the list

df = pd.DataFrame(data, columns= ['Name', 'Age', 'City'])

print(df) // output: Name Age City

0 John 25 New York
1 Alice 30 London
2 Bob 35 Paris

# load data from a CSV file

df = pd.read_csv('data.csv')
print(df)

# create an empty DataFrame

df = pd.DataFrame()

print(df) // output: Empty DataFrame

Columns: []

Index: []

Basic DataFrame Operations

Viewing Data:

DataFrame.head(n=5): Return the first n rows.

For negative values of n, this function returns all rows except the last |n| rows
If n is larger than the number of rows, this function returns all rows.

Parameters:

df = pd.DataFrame({'animal': ['alligator', 'bee', 'falcon', 'lion',

... 'monkey', 'parrot', 'shark', 'whale', 'zebra']})
>>> df
animal
0 alligator
1 bee
2 falcon
3 lion
4 monkey
5 parrot
6 shark
7 whale
8 zebra

>>> df.head()
animal
0 alligator
1 bee
2 falcon
3 lion
4 monkey

>>> df.head(3)
animal
0 alligator
1 bee
2 falcon

>>> df.head(-3)
animal
0 alligator
1 bee
2 falcon
3 lion
4 monkey
5 parrot

DataFrame.tail(n=5): Return the last n rows.

For negative values of n, this function returns all rows except the first n rows.
If n is larger than the number of rows, this function returns all rows.
>>> df.tail()
animal
4 monkey
5 parrot
6 shark
7 whale
8 zebra

>>> df.tail(3)
animal
6 shark
7 whale
8 zebra

>>> df.tail(-3)
animal
3 lion
4 monkey
5 parrot
6 shark
7 whale
8 zebra

DataFrame.info(): Print a concise summary of a DataFrame. This method prints

information about a DataFrame including the index dtype and columns, non-null values and
memory usage.
>>> df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5 entries, 0 to 4
Data columns (total 3 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 int_col 5 non-null int64
1 text_col 5 non-null object
2 float_col 5 non-null float64
dtypes: float64(1), int64(1), object(1)
memory usage: 248.0+ bytes

Accessing Columns and Rows

Single column: df['column_name'] or df.column_name
Multiple columns: df[['column1', 'column2']]
data = {'Name':['Jai', 'Princi', 'Gaurav', 'Anuj'],
'Age':[27, 24, 22, 32],
'Address':['Delhi', 'Kanpur', 'Allahabad', 'Kannauj'],
'Qualification':['Msc', 'MA', 'MCA', 'Phd']}

# Convert the dictionary into DataFrame

df = pd.DataFrame(data)

# select two columns

print (df[['Name', 'Qualification']])

Column addition: In Order to add a column in Pandas DataFrame, we can declare a new
list as a column and add to existing Dataframe.
data = {'Name': ['Jai', 'Princi', 'Gaurav', 'Anuj'],
'Height': [5.1, 6.2, 5.1, 5.2],
'Qualification': ['Msc', 'MA', 'Msc', 'Msc']}

# Convert the dictionary into DataFrame

df = pd.DataFrame(data)

# Declare a list that is to be converted into a column

address = ['Delhi', 'Bangalore', 'Chennai', 'Patna']

# Using 'Address' as the column name

# and equating it to the list
df['Address'] = address

# Observe the result

print(df)
Column Deletion: In Order to delete a column in Pandas DataFrame, we can use the
drop () method.

>>> df = pd.DataFrame(np.arange(12).reshape(3, 4),

... columns= ['A', 'B', 'C', 'D'])
>>> df
A B C D
0 0 1 2 3
1 4 5 6 7
2 8 9 10 11

>>> df.drop(columns=['B', 'C'])

A D
0 0 3
1 4 7
2 8 11

Drop a row by index

>>> df.drop([0, 1])

A B C D
2 8 9 10 11

Selecting Rows

By label: df.loc[label]: Label-based data selector. The end index is included during
slicing.

By index: df.iloc[index]: Index-based data selector. The end index is excluded during
slicing.

What is loc Method?

The loc[ ] is a label-based method used for selecting data as well as updating it. This is done
by passing the name (label) of the row or column that we wish to select.

Syntax: loc[row_label, column_label]

#Creating a Sample DataFrame

df = pd.DataFrame({
'id': [ 101, 102, 103, 104, 105, 106, 107],
'age': [ 20, 22, 23, 21, 22, 21, 25],
'group': [ 'A', 'B', 'C', 'C', 'B', 'A', 'A'],
'city': [ 'Tier1', 'Tier2', 'Tier2', 'Tier3', 'Tier1', 'Tier2', 'Tier1'],
'gender': [ 'M', 'F', 'F', 'M', 'M', 'M', 'F'],
'degree': [ 'econ', 'econ', 'maths', 'finance', 'history', 'science', 'marketing']
})

Selecting a row using loc[ ]

#Selecting a row with label
df.loc[102]

Slicing using loc[ ]

#Slicing using loc[]
df.loc[101:103]

Filtering rows using loc[ ]

#Selecting all rows with a given condition

df.loc[df.age >= 22]
#Selecting rows with multiple conditions
df.loc[(df.age >= 22) & (df.city == 'Tier1')]

Filtering columns using loc[ ]

#Selecting columns with a given condition

df.loc[(df.gender == 'M'), ['group', 'degree']]

Updating columns using loc[ ]

#Updating a column with a given condition

df.loc[(df.gender == 'M'), ['group']] = 'A'
df
#Updating multiple columns with a given condition
df.loc[(df.gender == 'F'), ['group', 'city']] = ['B','Tier2']
df

What is iloc Method?

The iloc[ ] is an index-based method used for data selection. In this case, we pass the
positions of the row or column that we wish to select (0-based integer index).

Syntax: iloc[row_position, column_position]

Selecting a row using iloc

#Selecting rows with index

df.iloc[[2,4]] // 2 and 4 refer to the index number, and hence the second and the fourth row
would be displayed

Selecting rows and columns using iloc

#Selecting rows with particular index and particular columns

df.iloc[[0,4],[1,3]] // [0,4] refers to index numbers 0 and 4 for rows and [1,3] refers to
index numbers for columns.

Slicing using iloc

#Selecting range of rows

data.iloc[1:5] // index number 5, that is the endpoint, is not included
We can also select a range of rows and columns:
#Selecting range of rows and columns
df.iloc[1:3,2:4]

Joining and Merging Dataframes

Pandas provides several functions for this purpose, primarily merge( ) and concat()

pandas.merge(left, right, how='inner', on=None, left_on=None, righ

t_on=None, left_index=False, right_index=False, sort=False, suffixe
s=('_x', '_y'), copy=None, indicator=False, validate=None): Merge
DataFrame or named Series objects with a database-style join. A named Series
object is treated as a DataFrame with a single named column.

how: {‘left’, ‘right’, ‘outer’, ‘inner’, ‘cross’}, default ‘inner’

Type of merge to be performed.

 left: use only keys from left frame, similar to a SQL left outer join;
preserve key order.
 right: use only keys from right frame, similar to a SQL right outer
join; preserve key order.
 outer: use union of keys from both frames, similar to a SQL full outer
join; sort keys lexicographically.
 inner: use intersection of keys from both frames, similar to a SQL
inner join; preserve the order of the left keys.
 cross: creates the cartesian product from both frames, preserves
the order of the left keys.

on: label or list: Column(s) to join on. If not specified, the merge will be on common
columns.
 Column or index level names to join on. These must be found in
both DataFrames. If on is None and not merging on indexes then
this defaults to the intersection of the columns in both DataFrames.
 left_on: Column(s) from the left DataFrame to use as keys.
 right_on: Column(s) from the right DataFrame to use as keys.
 suffixes: Suffix to apply to overlapping column names in the left and right side.

import pandas as pd key value1

df1 = pd.DataFrame({ 0 A 1
'key': ['A', 'B', 'C', 'D'], 1 B 2
'value1': [1, 2, 3, 4] 2 C 3
}) 3 D 4
df2 = pd.DataFrame({ key value2
'key': ['B', 'D', 'E', 'F'], 0 B 5
'value2': [5, 6, 7, 8] 1 D 6
}) 2 E 7
result = pd.merge(df1, df2) 3 F 8
print(result) key value1 value2
0 B 2 5
1 D 4 6

import pandas as pd key value1

df1 = pd.DataFrame({ 0 A 1
'key': ['A', 'B', 'C', 'D'], 1 B 2
'value1': [1, 2, 3, 4] 2 C 3
}) 3 D 4
df2 = pd.DataFrame({ key value2
'key': ['B', 'D', 'E', 'F'], 0 B 5
'value2': [5, 6, 7, 8] 1 D 6
}) 2 E 7
result = pd.merge(df1, df2, on=’key’) 3 F 8
print(result) key value1 value2
0 B 2 5
1 D 4 6

result = pd.merge(df1, df2, how='left') key value1

print(result) 0 A 1
1 B 2
2 C 3
3 D 4
key value2
0 B 5
1 D 6
2 E 7
3 F 8
key value1 value2
0 A 1 NaN
1 B 2 5.0
2 C 3 NaN
3 D 4 6.0

result = pd.merge(df1, df2, how='right') key value1

print(result) 0 A 1
1 B 2
2 C 3
3 D 4
key value2
0 B 5
1 D 6
2 E 7
3 F 8
key value1 value2
0 B 2.0 5
1 D 4.0 6
2 E NaN 7
3 F NaN 8

result = pd.merge(df1, df2, how='outer') key value1

print(result) 0 A 1
1 B 2
2 C 3
3 D 4
key value2
0 B 5
1 D 6
2 E 7
3 F 8
key value1 value2
0 A 1.0 NaN
1 B 2.0 5.0
2 C 3.0 NaN
3 D 4.0 6.0
4 E NaN 7.0
5 F NaN 8.0

result = pd.merge(df1, df2, how='inner') key value1

print(result) 0 A 1
1 B 2
2 C 3
3 D 4
key value2
0 B 5
1 D 6
2 E 7
3 F 8
key value1 value2
0 B 2 5
1 D 4 6

result = pd.merge(df1, df2, how='cross') key value1

print(result) 0 A 1
1 B 2
2 C 3
3 D 4
key value2
0 B 5
1 D 6
2 E 7
3 F 8
key_x value1 key_y value2
0 A 1 B 5
1 A 1 D 6
2 A 1 E 7
3 A 1 F 8
4 B 2 B 5
5 B 2 D 6
6 B 2 E 7
7 B 2 F 8
8 C 3 B 5
9 C 3 D 6
10 C 3 E 7
11 C 3 F 8
12 D 4 B 5
13 D 4 D 6
14 D 4 E 7
15 D 4 F 8

>>> df1 = pd.DataFrame({'lkey': ['foo', 'bar', 'baz', 'foo'],

... 'value': [1, 2, 3, 5]})
>>> df2 = pd.DataFrame({'rkey': ['foo', 'bar', 'baz', 'foo'],
... 'value': [5, 6, 7, 8]})
>>> df1.merge(df2, left_on='lkey', right_on='rkey')
lkey value_x rkey value_y
0 foo 1 foo 5
1 foo 1 foo 8
2 bar 2 bar 6
3 baz 3 baz 7
4 foo 5 foo 5
5 foo 5 foo 8
pandas.concat
pandas.concat(objs, *, axis=0, join='outer', ignore_index=False, keys=None, levels=None, names=
None, verify_integrity=False, sort=False, copy=None): Concatenate pandas objects along a
particular axis.

axis{0/’index’, 1/’columns’}, default 0

Axis to concatenate along (0 for rows, 1 for columns)
join{‘inner’, ‘outer’}, default ‘outer’
How to handle indexes on other axis (or axes).
ignore_index:bool, default False
If True, do not use the index values along the concatenation axis.
The resulting axis will be labeled 0, …, n - 1. This is useful if you are
concatenating objects where the concatenation axis does not have
meaningful indexing information. Note the index values on the other
axes are still respected in the join.
import pandas as pd letter number
df1 = pd.DataFrame([['a', 1], ['b', 2]], 0 a 1
columns=['letter', 'number']) 1 b 2
df2 = pd.DataFrame([['c', 3], ['d', 4]], 0 c 3
columns=['letter', 'number']) 1 d 4
print(pd.concat([df1, df2]))

import pandas as pd letter number letter number

df1 = pd.DataFrame([['a', 1], ['b', 2]], 0 a 1 c 3
columns=['letter', 'number']) 1 b 2 d 4
df2 = pd.DataFrame([['c', 3], ['d', 4]],
columns=['letter', 'number'])
print(pd.concat([df1, df2], axis=1))

import pandas as pd letter number

df1 = pd.DataFrame([['a', 1], ['b', 2]], 0 a 1
columns=['letter', 'number']) 1 b 2
df2 = pd.DataFrame([['c', 3], ['d', 4]], 2 c 3
columns=['letter', 'number']) 3 d 4
print(pd.concat([df1, df2], ignore_index
=True))

Pivot Table
Pivot tables in Pandas are powerful tools for summarizing data, allowing you to aggregate and
reshape your data easily. The pivot_table() function in Pandas is similar to Excel's pivot tables.

pd.pivot_table(data, values=None, index=None, columns=None, aggfunc='mean',

fill_value=None, margins=False, margins_name='All')
 data: The DataFrame to pivot.
 values: Column(s) to aggregate.
 index: Column(s) to group by on the rows.
 columns: Column(s) to group by on the columns.
 aggfunc: Function to aggregate values (e.g., mean, sum, count, etc.). Default is mean.
 fill_value: Value to replace missing values.
 margins: Add all rows/columns (like Excel's pivot table "Grand Totals").
 margins_name: Name of the margins row/column.

import pandas as pd Region Product Sales Quantity

0 East A 100 10
data = { 1 East B 150 15
'Region': ['East', 'East', 'West', 'West', 2 West A 200 20
'East', 'West'], 3 West B 250 25
'Product': ['A', 'B', 'A', 'B', 'A', 'B'], 4 East A 300 30
'Sales': [100, 150, 200, 250, 300, 400], 5 West B 400 40
'Quantity': [10, 15, 20, 25, 30, 40] Product A B
} Region
East 400 150
df = pd.DataFrame(data) West 200 650
print(df)
pivot = pd.pivot_table(df, values='Sales',
index='Region', columns='Product',
aggfunc='sum')
print(pivot)

pivot = pd.pivot_table(df, values='Sales', sum len

index='Region', columns='Product', Product A B A B
aggfunc=[sum, len]) Region
print(pivot) East 400 150 2 1
West 200 650 1 2

pivot = pd.pivot_table(df, values='Sales', Sales

index=['Region', 'Product'], aggfunc='sum') Region Product
print(pivot) East A 400
B 150
West A 200
B 650

pivot = pd.pivot_table(df, values='Sales', Product A B All

index='Region', columns='Product', Region
aggfunc='sum', margins=True) East 400 150 550
print(pivot) West 200 650 850
All 600 800 1400

pivot = pd.pivot_table(df, values=['Sales', Quantity Sales

'Quantity'], index='Region', columns='Product', Product A B A B
aggfunc='sum') Region
print(pivot) East 40 15 400 150
West 20 65 200 650

Aggregation and Grouping

DataFrame.agg(func=None, axis=0): Aggregate using one or more operations over the
specified axis.

Func: Function to use for aggregating the data. list of functions and/or
function names, e.g. [‘sum’, 'mean']

axis{0 or ‘index’, 1 or ‘columns’}, default 0

If 0 or ‘index’: apply function to each column. If 1 or ‘columns’:
apply function to each row.
>>> df = pd.DataFrame([[1, 2, 3],
... [4, 5, 6],
... [7, 8, 9],
... [np.nan, np.nan, np.nan]],
... columns=['A', 'B', 'C'])

Aggregate these functions over the rows.

>>>df.agg(['sum', 'min'])
A B C
sum 12.0 15.0 18.0
min 1.0 2.0 3.0

Different aggregations per column.

>>>df.agg({'A' : ['sum', 'min'], 'B' : ['min', 'max']})

A B
sum 12.0 NaN
min 1.0 2.0
max NaN 8.0

Aggregate over the columns.

>>>df.agg("mean", axis="columns")
0 2.0
1 5.0
2 8.0
3 NaN

DataFrame.groupby: The groupby() method is used to group data based on one or more
columns, and then you can apply aggregation functions such as mean(), sum(), count(), etc.

Syntax:
df.groupby('column_name').aggregate_function()
Example:
import pandas as pd

# Sample DataFrame
data = {
'Department': ['HR', 'HR', 'IT', 'IT', 'Finance', 'Finance'],
'Employee': ['John', 'Anna', 'Mike', 'Sara', 'Paul', 'Kate'],
'Salary': [50000, 60000, 70000, 80000, 75000, 65000]
}

df = pd.DataFrame(data)

# Group by Department and calculate the mean salary for each department
grouped_df = df.groupby('Department')['Salary'].mean()

print(grouped_df)

output:
Department
Finance 70000.0
HR 55000.0
IT 75000.0
Name: Salary, dtype: float64

Common Aggregation Functions:

 .sum(): Sum of values.

 .mean(): Mean of values.
 .count(): Count of non-null values.
 .min(), .max(): Minimum and maximum values.
 .std(), .var(): Standard deviation and variance.

Multiple Aggregations:

You can also apply multiple aggregation functions to the groups:

# Applying multiple aggregation functions

grouped_df = df.groupby('Department')['Salary'].agg(['mean', 'min', 'max'])

print(grouped_df)

output:

mean min max

Department
Finance 70000 65000 75000
HR 55000 50000 60000
IT 75000 70000 80000

Grouping by Multiple Columns:

You can group by multiple columns as well:

# Group by Department and Employee

grouped_df = df.groupby(['Department', 'Employee'])['Salary'].sum()

print(grouped_df)

output:

Department Employee

Finance Kate 65000

Paul 75000

HR Anna 60000

John 50000

IT Mike 70000

Sara 80000
Name: Salary, dtype: int64

Summary tables in pandas

In pandas, you can create summary tables (or summary statistics tables) using several built-in
methods. These tables provide key statistics such as mean, count, sum, etc., for each group or
the entire dataset.

1. Basic Summary Statistics Table: .describe()

The .describe() method provides a summary of statistics like count, mean, standard
deviation, min, and max for numeric columns.

import pandas as pd

# Sample DataFrame
data = {
'Department': ['HR', 'HR', 'IT', 'IT', 'Finance', 'Finance'],
'Salary': [50000, 60000, 70000, 80000, 75000, 65000],
'Years_Experience': [5, 7, 3, 10, 8, 6]
}
df = pd.DataFrame(data)
# Summary statistics
summary_table = df.describe()
print(summary_table)

output:

Salary Years_ Experience

count 6.000000 6.000000

mean 66666.666667 6.500000

std 11690.292990 2.338090

min 50000.000000 3.000000

25% 60000.000000 5.250000

50% 67500.000000 6.500000

75% 75000.000000 7.750000

max 80000.000000 10.000000

2. Summary Table by Group: .groupby() + Aggregation

You can use the groupby() function combined with aggregation methods to generate
summary tables based on categorical columns.

Example: Summary by Department

# Group by Department and calculate summary statistics

grouped_summary = df.groupby('Department').agg({
'Salary': ['mean', 'sum', 'min', 'max'],
'Years_Experience': ['mean', 'sum', 'min', 'max']
})

print(grouped_summary)

output:
Salary Years_Experience
mean sum min max mean sum min max
Department
Finance 70000.0 140000 65000 75000 7.0 14 6 8
HR 55000.0 110000 50000 60000 6.0 12 5 7
IT 75000.0 150000 70000 80000 6.5 13 3 10
3. Crosstab Summary: pd.crosstab()

pd.crosstab() creates a summary table (cross-tabulation) of counts or aggregations

between two or more categorical columns.

Example: Cross-tabulation of Departments and Years of Experience

# Crosstab of Department and Years_Experience

crosstab_summary = pd.crosstab(df['Department'], df['Years_Experience'])

print(crosstab_summary)

output:

Years_Experience 3 5 6 7 8 10

Department

Finance 0 0 1 0 1 0

HR 0 1 0 1 0 0

IT 1 0 0 0 0 1

In this case, it shows the number of employees in each department with various years of experience.

4. Pivot Tables: pd.pivot_table()

Pivot tables provide more flexibility by allowing you to compute aggregated values across
multiple dimensions.

Example: Pivot Table of Salary by Department and Years of Experience

# Pivot table for Salary with Department and Years_Experience

pivot_summary = pd.pivot_table(df, values='Salary', index='Department',

columns='Years_Experience', aggfunc='mean')

print(pivot_summary)

output:

Years_Experience 3 5 6 7 8 10

Department

Finance NaN NaN 65000.0 NaN 75000.0 NaN

HR NaN 50000.0 NaN 60000.0 NaN NaN

IT 70000.0 NaN NaN NaN NaN 80000.0

5. Custom Summary Table Using .apply()

If you need a highly customized summary, you can use .apply() to create a summary table
with user-defined functions.

Example: Custom Summary for Salary

# Custom function for range (max - min)

def salary_range(x):

return x.max() - x.min()

# Apply custom aggregation to Salary

custom_summary = df.groupby('Department').Salary.apply(salary_range)

print(custom_summary)

output:

Department

Finance 10000

HR 10000

IT 10000
Name: Salary, dtype: int64

Python Pandas 1
No ratings yet
Python Pandas 1
86 pages
Pandas Notes 1
No ratings yet
Pandas Notes 1
6 pages
Panda Ncert 1
No ratings yet
Panda Ncert 1
36 pages
Chapter 2 Data Handling using pandas - I(Series)
No ratings yet
Chapter 2 Data Handling using pandas - I(Series)
13 pages
pandas notes
No ratings yet
pandas notes
19 pages
Working With Pandas Notes
No ratings yet
Working With Pandas Notes
27 pages
Python Pandas
No ratings yet
Python Pandas
22 pages
Pandas
No ratings yet
Pandas
57 pages
XII-IP-QuickRevision
No ratings yet
XII-IP-QuickRevision
26 pages
Pandas
No ratings yet
Pandas
49 pages
1 (1)
No ratings yet
1 (1)
54 pages
Class 12 IP Ch-1, 2 3
No ratings yet
Class 12 IP Ch-1, 2 3
28 pages
Unit-1 Python Pandas (1)
No ratings yet
Unit-1 Python Pandas (1)
56 pages
12 IP Notes On Series
No ratings yet
12 IP Notes On Series
5 pages
12 IP Questions
No ratings yet
12 IP Questions
181 pages
XII_ip_Panda_I_Part_I_2023 (1) 1 1
No ratings yet
XII_ip_Panda_I_Part_I_2023 (1) 1 1
25 pages
Practical Questions (Python PGM NO 1 To 21)
No ratings yet
Practical Questions (Python PGM NO 1 To 21)
61 pages
Reading Material For Data Handling Using Pandas-I
No ratings yet
Reading Material For Data Handling Using Pandas-I
51 pages
Data Handlinng Using Pandas
No ratings yet
Data Handlinng Using Pandas
46 pages
IpNotes
No ratings yet
IpNotes
72 pages
12ip 22 23
No ratings yet
12ip 22 23
188 pages
UNIT 3(Chapter 2) Pandas
No ratings yet
UNIT 3(Chapter 2) Pandas
43 pages
leip102
No ratings yet
leip102
36 pages
Python Pandas Series
No ratings yet
Python Pandas Series
30 pages
Informatics Practices Class 12 Study Material
No ratings yet
Informatics Practices Class 12 Study Material
128 pages
DV FINAL QB
No ratings yet
DV FINAL QB
60 pages
Data Handlinng Using Pandas-I
No ratings yet
Data Handlinng Using Pandas-I
46 pages
Unit I: Data Handling Using Pandas and Data Visualization: Marks:25
No ratings yet
Unit I: Data Handling Using Pandas and Data Visualization: Marks:25
135 pages
Data Handling Using Pandas-1
No ratings yet
Data Handling Using Pandas-1
25 pages
1 Data Handlinng Using Pandas-I
No ratings yet
1 Data Handlinng Using Pandas-I
46 pages
Unit I: Data Handling Using Pandas and Data Visualization: Marks:30
No ratings yet
Unit I: Data Handling Using Pandas and Data Visualization: Marks:30
75 pages
XII IP Ch 1 Python Pandas - I Series
No ratings yet
XII IP Ch 1 Python Pandas - I Series
45 pages
1 IP 12 NOTES PythonPandas 2022 PDF
100% (3)
1 IP 12 NOTES PythonPandas 2022 PDF
66 pages
Python Data Processing
No ratings yet
Python Data Processing
36 pages
ATA Andling - 25 MARKS: D H Pandas
No ratings yet
ATA Andling - 25 MARKS: D H Pandas
102 pages
Python Pandas-Series-neww
100% (1)
Python Pandas-Series-neww
80 pages
IP NOTES
No ratings yet
IP NOTES
20 pages
4.
No ratings yet
4.
7 pages
Ncert Pandas
No ratings yet
Ncert Pandas
36 pages
Data Science - Unit-3-Part-2
No ratings yet
Data Science - Unit-3-Part-2
32 pages
Exp 25_26
No ratings yet
Exp 25_26
17 pages
Unit II Notes Revision
No ratings yet
Unit II Notes Revision
20 pages
Class XII Data Handlinng Using PandasI
No ratings yet
Class XII Data Handlinng Using PandasI
46 pages
Unit 1 Pandas - Series and DataFrame
No ratings yet
Unit 1 Pandas - Series and DataFrame
19 pages
Pandas - Series - Short - Notes
No ratings yet
Pandas - Series - Short - Notes
7 pages
Class12 Pandas Notes
No ratings yet
Class12 Pandas Notes
23 pages
Ip 102
No ratings yet
Ip 102
36 pages
P03 Introduction To Pandas Ans
No ratings yet
P03 Introduction To Pandas Ans
45 pages
XII-IP-QuickRevision 2 in 1
No ratings yet
XII-IP-QuickRevision 2 in 1
13 pages
CH 02 - Data Handling Using Pandas Leip102 EDITED Smaller 01 Codes Only
No ratings yet
CH 02 - Data Handling Using Pandas Leip102 EDITED Smaller 01 Codes Only
15 pages
Ln. 1 - Data handling using Pandas - Series & Dataframe
No ratings yet
Ln. 1 - Data handling using Pandas - Series & Dataframe
14 pages
Study Material IP 2022
No ratings yet
Study Material IP 2022
55 pages
Introduction to Pandas & Data Structures
No ratings yet
Introduction to Pandas & Data Structures
11 pages
Unit 2
No ratings yet
Unit 2
81 pages
Exp8 SBLC
No ratings yet
Exp8 SBLC
9 pages
NS2000 User Guide (v4.2)
No ratings yet
NS2000 User Guide (v4.2)
110 pages
Python Pandas Series
No ratings yet
Python Pandas Series
37 pages
Data Handling Using Pandas I - Series
No ratings yet
Data Handling Using Pandas I - Series
11 pages
Lab Activity: Telecommunications Rooms: Building Entrance Facility
No ratings yet
Lab Activity: Telecommunications Rooms: Building Entrance Facility
6 pages
transaction
No ratings yet
transaction
3 pages
CN After Mid-2
No ratings yet
CN After Mid-2
13 pages
01CE0502AdvancedJavaProgrammingpdf 2020 07 03 15 59 18
No ratings yet
01CE0502AdvancedJavaProgrammingpdf 2020 07 03 15 59 18
7 pages
6D286504-DE31-11EF-98E1-880324B5EAEC
No ratings yet
6D286504-DE31-11EF-98E1-880324B5EAEC
21 pages
01 Laboratory Exercise 1 - Computer Organization Basics
100% (1)
01 Laboratory Exercise 1 - Computer Organization Basics
3 pages
index and hashng
No ratings yet
index and hashng
2 pages
Class Notes Class: XII Date: 17-04-2021 Subject: Informatics Practices Topic: Chapter-1
No ratings yet
Class Notes Class: XII Date: 17-04-2021 Subject: Informatics Practices Topic: Chapter-1
5 pages
List
No ratings yet
List
184 pages
AE-Math7 (23) - Mock Test 2 - Paper 2 QP
No ratings yet
AE-Math7 (23) - Mock Test 2 - Paper 2 QP
15 pages
The Intel microprocessors: 8086/8088, 80186/80188, 80286, 80386, 80486, Pentium, Pentium Pro processor, Pentium II, Pentium III, Pentium 4, and Core2 with 64-bit extensions: architecture, programming, and interfacing 8th ed Edition Barry B Brey - eBook PDFpdf download
No ratings yet
The Intel microprocessors: 8086/8088, 80186/80188, 80286, 80386, 80486, Pentium, Pentium Pro processor, Pentium II, Pentium III, Pentium 4, and Core2 with 64-bit extensions: architecture, programming, and interfacing 8th ed Edition Barry B Brey - eBook PDFpdf download
51 pages
Remote Terminal Unit (RTU) Hardware Design PDF
No ratings yet
Remote Terminal Unit (RTU) Hardware Design PDF
4 pages
Manipulating Text with Regular Expression in python
No ratings yet
Manipulating Text with Regular Expression in python
4 pages
Mastering Data Analysis (SQL) - Day 10_11.PDF
No ratings yet
Mastering Data Analysis (SQL) - Day 10_11.PDF
3 pages
Adec Dryers Manual Rev2
No ratings yet
Adec Dryers Manual Rev2
34 pages
Manual Warp Pyxis
No ratings yet
Manual Warp Pyxis
13 pages
Format of The SIT Report Rev 2024
No ratings yet
Format of The SIT Report Rev 2024
9 pages
ITC - Chapter # 11
No ratings yet
ITC - Chapter # 11
43 pages
HXC Floppy Emulator Software User Manual ENG
No ratings yet
HXC Floppy Emulator Software User Manual ENG
17 pages
ProxiGuard Patrol Management System 7x Manual English
No ratings yet
ProxiGuard Patrol Management System 7x Manual English
29 pages
It 2
No ratings yet
It 2
7 pages
Answer MRN WJEC CS 001-009
No ratings yet
Answer MRN WJEC CS 001-009
9 pages
Design of IIR Elliptical Band Pass Filter: Expt. No.: 3A
No ratings yet
Design of IIR Elliptical Band Pass Filter: Expt. No.: 3A
12 pages
1-Ty2 SBT Global Success
No ratings yet
1-Ty2 SBT Global Success
6 pages
Wep4200 S
No ratings yet
Wep4200 S
2 pages
Trademiner Review
No ratings yet
Trademiner Review
4 pages
Online Project Monitoring System Based On Cloud Computing Platform
No ratings yet
Online Project Monitoring System Based On Cloud Computing Platform
10 pages
Implementing Various Systems With DevOps To Make Successful Decisions Based On Intelligent Learning Strategy
No ratings yet
Implementing Various Systems With DevOps To Make Successful Decisions Based On Intelligent Learning Strategy
6 pages
Student Data Upload (SDU) File Requirements and Loading Instructions - CSV Format State Testing or District Testing
No ratings yet
Student Data Upload (SDU) File Requirements and Loading Instructions - CSV Format State Testing or District Testing
9 pages
Robtics: Lunatico Astronomia Seletek ARMADILLO 2 Controller
No ratings yet
Robtics: Lunatico Astronomia Seletek ARMADILLO 2 Controller
3 pages
Mantenimiento Pc-Lab-Prototipos
No ratings yet
Mantenimiento Pc-Lab-Prototipos
7 pages
Institutional Chatbot For College Programs and Course-Related Inquiries
No ratings yet
Institutional Chatbot For College Programs and Course-Related Inquiries
12 pages
Computer Operation Final
No ratings yet
Computer Operation Final
10 pages
Acscm PDF
No ratings yet
Acscm PDF
2 pages
Edexcel GCSE ICT Revision Guide
100% (1)
Edexcel GCSE ICT Revision Guide
40 pages
The Essential R Reference
From Everand
The Essential R Reference
Mark Gardener
No ratings yet
Profound Python Data Science
From Everand
Profound Python Data Science
Onder Teker
No ratings yet