0% found this document useful (0 votes)
22 views33 pages

05 Pandas Data Frames

A Pandas DataFrame is a two-dimensional, mutable data structure in Python for storing and manipulating tabular data, similar to a spreadsheet. It allows for various operations such as filtering, grouping, and merging, and can be created from different data formats including dictionaries, lists, and external files. Key features include indexing, label-based access, and the ability to handle heterogeneous data types.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
22 views33 pages

05 Pandas Data Frames

A Pandas DataFrame is a two-dimensional, mutable data structure in Python for storing and manipulating tabular data, similar to a spreadsheet. It allows for various operations such as filtering, grouping, and merging, and can be created from different data formats including dictionaries, lists, and external files. Key features include indexing, label-based access, and the ability to handle heterogeneous data types.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 33

Pandas Data Frames Notes:

A Pandas DataFrame is a 2-dimensional, size-mutable, and potentially heterogeneous tabular data


structure in Python, used for storing and manipulating data. It is part of the Pandas library, which is
widely used for data analysis in Python.
Here are the key features of a DataFrame:
1. Rows and Columns: DataFrames consist of rows and columns, similar to a table or
spreadsheet, where each column can hold data of a different type (e.g., integers, floats,
strings).
2. Indexing: DataFrames have an index for rows and columns, allowing easy access to data by
row and column labels.
3. Label-based and position-based access: You can access data both by using labels (e.g.,
column names, row index) or positions (e.g., row number, column number).
4. Data Manipulation: You can perform various operations on a DataFrame, like filtering,
grouping, merging, reshaping, and handling missing data.

Example of creating a DataFrame:


import pandas as pd

# Example data
data = {
'Name': ['Alice', 'Bob', 'Charlie'],
'Age': [24, 27, 22],
'City': ['New York', 'Los Angeles', 'Chicago']
}

# Create DataFrame
df = pd.DataFrame(data)

print(df)

Output:
Name Age City
0 Alice 24 New York
1 Bob 27 Los Angeles
2 Charlie 22 Chicago

Each column in a DataFrame is a Series


Just interested in working with the data in the column Age

df["Age"]

Out[4]:

0 22
1 35

2 58

Name: Age, dtype: int64

When selecting a single column of a pandas DataFrame, the result is a pandas


Series. To select the column, use the column label in between square brackets
[].

In Pandas, there are several ways to create DataFrames depending on the data format you have.
Here are some of the most common methods:

1. From a Dictionary
You can create a DataFrame by passing a dictionary where the keys are column names and the
values are lists or arrays of data.
import pandas as pd

data = {
'Name': ['Jack', 'Bob', 'Tom'],
'Age': [24, 27, 22],
'City': ['Pune', 'Jaipur', 'Mumbai']
}

df = pd.DataFrame(data)
print(df)

2. From a List of Lists (or Tuples)


You can create a DataFrame by passing a list (or list of tuples) where each element of the list
represents a row of data.
data = [['Jack', 24, 'Pune'], ['Bob', 27, 'Jaipur'], ['Tom', 22, 'Mumbai']]
df = pd.DataFrame(data, columns=['Name', 'Age', 'City'])
print(df)

3. From a List of Dictionaries


You can also create a DataFrame from a list of dictionaries, where each dictionary represents a row,
and the keys are the column names.
data = [{'Name': 'Jack', 'Age': 24, 'City': 'Pune'},
{'Name': 'Bob', 'Age': 27, 'City': 'Jaipur'},
{'Name': 'Tom', 'Age': 22, 'City': 'Mumbai'}]
df = pd.DataFrame(data)
print(df)

4. From a Numpy Array


If you have a NumPy array, you can pass it to pd.DataFrame(), and optionally specify column
names.
import numpy as np

data = np.array([['Jack', 24, 'Pune'], ['Bob', 27, 'Jaipur'], ['Tom', 22,


'Mumbai']])
df = pd.DataFrame(data, columns=['Name', 'Age', 'City'])
print(df)

5. From CSV or Excel Files


You can create a DataFrame by reading data from an external CSV or Excel file.
# From CSV
df = pd.read_csv('file.csv')

# From Excel
df = pd.read_excel('file.xlsx')

6. From a Series
You can create a DataFrame from a Pandas Series. If you have a single column, you can convert it
into a DataFrame.
import pandas as pd

# Creating a series
s = pd.Series([24, 27, 22], index=['Jack', 'Bob', 'Tom'])

# Converting series to DataFrame


df = s.to_frame(name='Age')
print(df)

7. Using pd.DataFrame.from_records()
This method is useful when you have a list of records (usually dictionaries) and want to convert it
into a DataFrame.
data = [{'Name': 'Jack', 'Age': 24, 'City': 'Pune'},
{'Name': 'Bob', 'Age': 27, 'City': 'Jaipur'},
{'Name': 'Tom', 'Age': 22, 'City': 'Mumbai'}]

df = pd.DataFrame.from_records(data)
print(df)

8. From a Dictionary of Series


You can also create a DataFrame by passing a dictionary where each key is the column name, and
the value is a Pandas Series.
import pandas as pd

data = {
'Name': pd.Series(['Jack', 'Bob', 'Tom']),
'Age': pd.Series([24, 27, 22]),
'City': pd.Series(['Pune', 'Jaipur', 'Mumbai'])
}

df = pd.DataFrame(data)
print(df)

9. From a Tuple of Tuples (Multi-Index DataFrame)


You can create a DataFrame with multiple levels of index (MultiIndex) by passing a tuple of tuples.
data = {
('Name', 'First'): ['Jack', 'Bob', 'Tom'],
('Name', 'Last'): ['Smith', 'Johnson', 'Brown'],
'Age': [24, 27, 22],
'City': ['Pune', 'Jaipur', 'Mumbai']
}

df = pd.DataFrame(data)
print(df)

10. From a JSON File


You can read a JSON file and convert it into a DataFrame:
df = pd.read_json('data.json')
print(df)

11. From a SQL Query


You can execute a SQL query on a database and load the result directly into a DataFrame.
import sqlite3

conn = sqlite3.connect('database.db')
query = "SELECT * FROM users"
df = pd.read_sql(query, conn)
print(df)

These are some of the most common ways to create DataFrames in Pandas. Depending on the data
you have, you can choose the method that works best for you!
Here's a simple usage example of creating and working with a Pandas DataFrame using the
dictionary method from above:

Example: Creating and Manipulating a DataFrame


import pandas as pd

# Step 1: Create a DataFrame


data = {
'Name': ['Jack', 'Bob', 'Tom'],
'Age': [24, 27, 22],
'City': ['Pune', 'Jaipur', 'Mumbai']
}

# Create DataFrame from the dictionary


df = pd.DataFrame(data)

# Step 2: Display the DataFrame


print("Original DataFrame:")
print(df)

# Step 3: Access a column


print("\nAge column:")
print(df['Age'])

# Step 4: Filter rows based on a condition


print("\nPeople older than 23:")
print(df[df['Age'] > 23])
# Step 5: Add a new column
df['Country'] = ['India', 'India', 'India']

# Step 6: Display the updated DataFrame


print("\nUpdated DataFrame with Country column:")
print(df)

# Step 7: Select specific rows and columns


print("\nSelect 'Name' and 'City' for people older than 23:")
print(df.loc[df['Age'] > 23, ['Name', 'City']])

Output:
Original DataFrame:
Name Age City
0 Jack 24 Pune
1 Bob 27 Jaipur
2 Tom 22 Mumbai

Age column:
0 24
1 27
2 22
Name: Age, dtype: int64

People older than 23:


Name Age City
0 Jack 24 Pune
1 Bob 27 Jaipur

Updated DataFrame with Country column:


Name Age City Country
0 Jack 24 Pune India
1 Bob 27 Jaipur India
2 Tom 22 Mumbai India

Select 'Name' and 'City' for people older than 23:


Name City
0 Jack Pune
1 Bob Jaipur

Explanation:
1. Create a DataFrame: We create a simple DataFrame from a dictionary.
2. Display the DataFrame: Print the DataFrame to view the data.
3. Access a column: Retrieve the 'Age' column.
4. Filter rows: Filter the rows where 'Age' is greater than 23.
5. Add a new column: We add a 'Country' column to the DataFrame.
6. Select specific rows and columns: Use .loc[] to select specific rows (people older than
23) and columns ('Name' and 'City').

This is a basic demonstration of creating, manipulating, and accessing data within a Pandas
DataFrame.
Examples of Deleting Rows and Columns in Pandas DataFrame
You can delete rows and columns using the drop() method.

1. Deleting Columns
Columns can be dropped using df.drop(columns=["col_name"]) or
df.drop("col_name", axis=1).
import pandas as pd

# Sample DataFrame
data = {
"Name": ["Alice", "Bob", "Charlie"],
"Age": [25, 30, 35],
"Salary": [50000, 60000, 70000]
}

df = pd.DataFrame(data)
print("Original DataFrame:\n", df)

# **Deleting a single column**


df = df.drop(columns=["Salary"])
print("\nAfter Deleting 'Salary' Column:\n", df)

# **Deleting multiple columns**


df = df.drop(columns=["Age"])
print("\nAfter Deleting 'Age' Column:\n", df)

Output:
Original DataFrame:
Name Age Salary
0 Alice 25 50000
1 Bob 30 60000
2 Charlie 35 70000

After Deleting 'Salary' Column:


Name Age
0 Alice 25
1 Bob 30
2 Charlie 35

After Deleting 'Age' Column:


Name
0 Alice
1 Bob
2 Charlie

2. Deleting Rows
Rows are dropped using df.drop(index=[row_index]) or df.drop(row_index,
axis=0).
# Sample DataFrame
df = pd.DataFrame(data)

# **Deleting a single row by index**


df = df.drop(index=[1]) # Drops row at index 1 (Bob)
print("\nAfter Deleting Row with Index 1:\n", df)

# **Deleting multiple rows**


df = df.drop(index=[0, 2]) # Drops rows at index 0 and 2 (Alice & Charlie)
print("\nAfter Deleting Rows with Index 0 and 2:\n", df)

Output:
After Deleting Row with Index 1:
Name Age Salary
0 Alice 25 50000
2 Charlie 35 70000

After Deleting Rows with Index 0 and 2:


Empty DataFrame
Columns: [Name, Age, Salary]
Index: []

3. Deleting Rows Based on a Condition


You can drop rows based on a condition using boolean indexing.
df = pd.DataFrame(data)

# **Delete rows where Age is greater than 30**


df = df[df["Age"] <= 30]
print("\nAfter Deleting Rows Where Age > 30:\n", df)

Output:
After Deleting Rows Where Age > 30:
Name Age Salary
0 Alice 25 50000
1 Bob 30 60000
Renaming Row and Column Labels in a Pandas DataFrame
You can rename columns and row labels (index) using the .rename() method in Pandas.

1. Renaming Column Labels


Use df.rename(columns={"old_col_name": "new_col_name"}).

Example: Renaming Columns


import pandas as pd

# Sample DataFrame
data = {
"Name": ["Alice", "Bob", "Charlie"],
"Age": [25, 30, 35],
"Salary": [50000, 60000, 70000]
}

df = pd.DataFrame(data)
print("Original DataFrame:\n", df)

# Renaming Columns
df = df.rename(columns={"Name": "Full Name", "Age": "Years", "Salary":
"Income"})
print("\nAfter Renaming Columns:\n", df)

Output:
Original DataFrame:
Name Age Salary
0 Alice 25 50000
1 Bob 30 60000
2 Charlie 35 70000

After Renaming Columns:


Full Name Years Income
0 Alice 25 50000
1 Bob 30 60000
2 Charlie 35 70000

2. Renaming Row Labels (Index Values)


Use df.rename(index={old_index: new_index}).

Example: Renaming Index


# Renaming Row Labels
df = df.rename(index={0: "A", 1: "B", 2: "C"})
print("\nAfter Renaming Row Labels:\n", df)

Output:
After Renaming Row Labels:
Full Name Years Income
A Alice 25 50000
B Bob 30 60000
C Charlie 35 70000

3. Renaming Both Columns and Index Together


df = df.rename(columns={"Years": "Age"}, index={"A": "Student1", "B":
"Student2"})
print("\nAfter Renaming Both Columns and Index:\n", df)

Alternative: Using .columns and .index Directly


If you want to rename all column names or row labels at once:
df.columns = ["NewName1", "NewName2", "NewName3"]
df.index = ["Row1", "Row2", "Row3"]

Conclusion
• Rename Columns → df.rename(columns={"old_name": "new_name"})
• Rename Rows (Index) → df.rename(index={old_index: new_index})
• Rename Both → df.rename(columns=..., index=...)
• Change All at Once → df.columns = [...], df.index = [...]

A DataFrame is a two-dimensional, mutable data structure in pandas, similar to an Excel


spreadsheet or SQL table. It consists of rows and columns, where columns can have different data
types.

Attributes of DataFrame
These attributes provide metadata about the DataFrame:
1. df.shape – Returns the dimensions of the DataFrame as (rows, columns).
2. df.size – Returns the total number of elements (rows × columns).
3. df.ndim – Returns the number of dimensions (always 2 for a DataFrame).
4. df.columns – Returns the column labels as an Index object.
5. df.index – Returns the row labels as an Index object.
6. df.dtypes – Returns the data types of each column.
7. df.values – Returns the underlying NumPy array of values.
8. df.info() – Prints metadata, including column types and non-null values.
9. df.T – Transposes the DataFrame (rows become columns and vice versa).
# Example demonstrating the use of DataFrame attributes in Python using
pandas

import pandas as pd

# Creating a sample DataFrame


data = {
'Name': ['Alice', 'Bob', 'Charlie'],
'Age': [25, 30, 35],
'Salary': [50000, 60000, 70000]
}

df = pd.DataFrame(data)

# Display the DataFrame


print("DataFrame:")
print(df, "\n")

# Using different attributes


print("Shape of DataFrame:", df.shape) # (rows, columns)
print("Size of DataFrame:", df.size) # Total elements (rows * columns)
print("Number of Dimensions:", df.ndim) # Should always be 2 for DataFrame
print("Column Names:", df.columns) # List of column names
print("Row Index:", df.index) # Index range
print("Data Types of Columns:\n", df.dtypes) # Data types of each column
print("Underlying NumPy array:\n", df.values) # Extract values as array
print("\nDataFrame Info:")
df.info() # Prints summary information about DataFrame

# Transposing the DataFrame


print("\nTransposed DataFrame:")
print(df.T)

Uses of DataFrame
1. Data Manipulation – Adding, updating, or deleting rows/columns.
2. Data Cleaning – Handling missing values, filtering, and replacing data.
3. Data Analysis – Aggregation, grouping, and statistical analysis.
4. Data Transformation – Applying functions, pivoting, and reshaping data.
5. Data Visualization – Plotting data using matplotlib and seaborn.
6. Integration with Databases – Reading from and writing to SQL, CSV, Excel, etc.
7. Machine Learning – Preprocessing and feature engineering for models.
The pandas DataFrame methods head(), tail(), info(),
and describe()
The pandas DataFrame methods head(), tail(), info(), and describe() are essential for
exploring and summarizing data. Below is a detailed explanation with examples.

1. head(n) – Display First n Rows


• Usage: df.head(n)
• Default: If n is not specified, it returns the first 5 rows.
• Purpose: Quickly inspect the top portion of the DataFrame.

Example:
import pandas as pd

# Creating a sample DataFrame


data = {
'Name': ['Alice', 'Bob', 'Charlie', 'David', 'Emma'],
'Age': [25, 30, 35, 40, 45],
'Salary': [50000, 60000, 70000, 80000, 90000]
}

df = pd.DataFrame(data)

# Display first 3 rows


print(df.head(3))

Output:
Name Age Salary
0 Alice 25 50000
1 Bob 30 60000
2 Charlie 35 70000

Useful for checking column names, data types, and first few values.

2. tail(n) – Display Last n Rows


• Usage: df.tail(n)
• Default: If n is not specified, it returns the last 5 rows.
• Purpose: Useful for inspecting the end of the dataset.

Example:
print(df.tail(2)) # Display last 2 rows
Output:
Name Age Salary
3 David 40 80000
4 Emma 45 90000

Helps verify the last few records in the dataset.

3. info() – Summary of DataFrame


• Usage: df.info()
• Purpose: Provides metadata about the DataFrame, including:
• Number of rows and columns
• Column names and data types
• Non-null values per column
• Memory usage

Example:
df.info()

Output:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5 entries, 0 to 4
Data columns (total 3 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 Name 5 non-null object
1 Age 5 non-null int64
2 Salary 5 non-null int64
dtypes: int64(2), object(1)
memory usage: 248.0 bytes

Helps identify data types, missing values, and memory usage.

4. describe() – Statistical Summary


• Usage: df.describe()
• Purpose: Provides summary statistics for numerical columns, including:
• Count (number of non-null values)
• Mean (average value)
• Standard deviation
• Minimum and Maximum values
• 25th, 50th (median), and 75th percentiles

Example:
print(df.describe())
Output:
Age Salary
count 5.000000 5.000000
mean 35.000000 70000.000000
std 7.905694 15811.388301
min 25.000000 50000.000000
25% 30.000000 60000.000000
50% 35.000000 70000.000000
75% 40.000000 80000.000000
max 45.000000 90000.000000

Helps understand data distribution and identify outliers.

Python code with detailed comments for analyzing employee salary data using
pandas DataFrame methods: head(), tail(), info(), and describe().
# Import pandas library
import pandas as pd

# Step 1: Create Sample Employee Data


data = {
'Name': ['Alice', 'Bob', 'Charlie', 'David', 'Emma', 'Frank', 'Grace',
'Henry', 'Ivy', 'Jack'],
'Age': [25, 30, 35, 40, 28, 45, 32, 38, 29, 50], # Employee ages
'Department': ['HR', 'IT', 'Finance', 'IT', 'HR', 'Finance', 'IT', 'HR',
'Finance', 'IT'], # Department names
'Experience': [2, 5, 10, 12, 3, 20, 7, 9, 4, 15], # Work experience in
years
'Salary': [50000, 70000, 85000, 95000, 55000, 120000, 75000, 88000, 62000,
110000] # Annual salaries in $
}

# Create a DataFrame from the dictionary


df = pd.DataFrame(data)

# Step 2: Display First Few Rows Using head()


print("\n First 5 Rows of the Dataset:")
print(df.head()) # By default, displays the first 5 rows

# Step 3: Display Last Few Rows Using tail()


print("\n� Last 5 Rows of the Dataset:")
print(df.tail()) # By default, displays the last 5 rows

# Step 4: Get Dataset Summary Using info()


print("\n� Dataset Information:")
df.info() # Shows structure, data types, and missing values

# Step 5: Get Summary Statistics Using describe()


print("\n Statistical Summary of Numerical Columns:")
print(df.describe()) # Shows statistics for numerical columns (Age, Experience,
Salary)
Working with Joining, Merging, and Concatenation in Pandas
When working with multiple DataFrames, we often need to combine them using different
techniques. Pandas provides three primary ways to achieve this:
1. Merging (merge()) – Similar to SQL joins.
2. Joining (join()) – Works with index-based joins.
3. Concatenation (concat()) – Stacks DataFrames vertically (rows) or horizontally
(columns).

Sure! Here’s a clear explanation of how to work with joining, merging, and concatenation in
Pandas.

1. Merging DataFrames (merge())


Merging is similar to SQL joins and is used to combine DataFrames based on a common column.

Example: Merge Using a Common Column


import pandas as pd

# Creating first DataFrame


employees = pd.DataFrame({
'Emp_ID': [101, 102, 103, 104],
'Name': ['Alice', 'Bob', 'Charlie', 'David'],
'Department': ['HR', 'IT', 'Finance', 'IT']
})

# Creating second DataFrame


salaries = pd.DataFrame({
'Emp_ID': [101, 102, 103, 105],
'Salary': [50000, 70000, 85000, 90000]
})

# Merging on 'Emp_ID' (Inner Join by default)


merged_df = pd.merge(employees, salaries, on='Emp_ID')
print(merged_df)

Output:
Emp_ID Name Department Salary
0 101 Alice HR 50000
1 102 Bob IT 70000
2 103 Charlie Finance 85000

By default, merge() performs an inner join, keeping only matching records from both
DataFrames.

Different Types of Joins in merge()


# Left Join (Keeps all employees, fills NaN for missing salaries)
left_join = pd.merge(employees, salaries, on='Emp_ID', how='left')

# Right Join (Keeps all salary records, fills NaN for missing employees)
right_join = pd.merge(employees, salaries, on='Emp_ID', how='right')

# Outer Join (Keeps all records from both tables, fills NaN where data is
missing)
outer_join = pd.merge(employees, salaries, on='Emp_ID', how='outer')

2. Joining DataFrames (join())


The join() method is used to merge DataFrames based on the index instead of a column.

Example: Join Using Index


# Creating first DataFrame with index
dept_df = pd.DataFrame({
'Department': ['HR', 'IT', 'Finance'],
'Manager': ['John', 'Emma', 'Michael']
}).set_index('Department')

# Creating second DataFrame with index


salary_df = pd.DataFrame({
'Department': ['HR', 'IT', 'Finance'],
'Avg_Salary': [60000, 80000, 90000]
}).set_index('Department')

# Joining DataFrames on index


joined_df = dept_df.join(salary_df)
print(joined_df)

Output:
Manager Avg_Salary
Department
HR John 60000
IT Emma 80000
Finance Michael 90000

This method is useful when working with hierarchical data or index-based tables.
In Pandas, there are four main types of joins used to merge two DataFrames. These correspond to
SQL join operations:

1. Inner Join (Default in merge())


• Keeps only the matching rows from both DataFrames.
• If a key exists in one DataFrame but not the other, it is excluded.

Example
import pandas as pd

df1 = pd.DataFrame({'ID': [1, 2, 3], 'Name': ['Alice', 'Bob', 'Charlie']})


df2 = pd.DataFrame({'ID': [2, 3, 4], 'Salary': [70000, 85000, 90000]})

inner_join = pd.merge(df1, df2, on='ID', how='inner')


print(inner_join)
Output
ID Name Salary
0 2 Bob 70000
1 3 Charlie 85000

• Row with ID = 1 from df1 is dropped (not in df2)


• Row with ID = 4 from df2 is dropped (not in df1)

2. Left Join
• Keeps all rows from the left DataFrame (df1) and only the matching rows from the right
DataFrame (df2).
• Unmatched rows from the right DataFrame will have NaN values.

Example
left_join = pd.merge(df1, df2, on='ID', how='left')
print(left_join)

Output
ID Name Salary
0 1 Alice NaN
1 2 Bob 70000.0
2 3 Charlie 85000.0

• Row with ID = 1 is kept from df1, but has no match in df2, so Salary = NaN
• Row with ID = 4 from df2 is dropped

3. Right Join
• Keeps all rows from the right DataFrame (df2) and only the matching rows from the left
DataFrame (df1).
• Unmatched rows from the left DataFrame will have NaN values.

Example
right_join = pd.merge(df1, df2, on='ID', how='right')
print(right_join)

Output
ID Name Salary
0 2 Bob 70000
1 3 Charlie 85000
2 4 NaN 90000

• Row with ID = 4 is kept from df2, but has no match in df1, so Name = NaN
• Row with ID = 1 from df1 is dropped
4. Outer Join
• Keeps all rows from both DataFrames.
• If a key exists in one DataFrame but not the other, the missing values are filled with NaN.

Example
outer_join = pd.merge(df1, df2, on='ID', how='outer')
print(outer_join)

Output
ID Name Salary
0 1 Alice NaN
1 2 Bob 70000
2 3 Charlie 85000
3 4 NaN 90000

• All records from both DataFrames are retained


• Missing values are filled with NaN

Comparison of Join Types

❌ ❌ ✅
Join Type Keeps All Left Rows? Keeps All Right Rows? Keeps Only Matching Rows?

✅ ❌ ❌
Inner No No Yes

❌ ✅ ❌
Left Yes No No

✅ ✅ ❌
Right No Yes No
Outer Yes Yes No

3. Concatenating DataFrames (concat())


Concatenation is used to stack DataFrames vertically (rows) or horizontally (columns).

Example: Concatenating Row-wise (Vertical Stacking)


# Creating two DataFrames with the same columns
df1 = pd.DataFrame({'ID': [1, 2], 'Name': ['Alice', 'Bob']})
df2 = pd.DataFrame({'ID': [3, 4], 'Name': ['Charlie', 'David']})

# Concatenating along rows (axis=0)


vertical_concat = pd.concat([df1, df2])
print(vertical_concat)

Output:
ID Name
0 1 Alice
1 2 Bob
0 3 Charlie
1 4 David

The index is not reset automatically. You can use ignore_index=True to fix it.
Example: Concatenating Column-wise (Horizontal Stacking)
# Creating DataFrames with the same number of rows
df3 = pd.DataFrame({'ID': [1, 2], 'Salary': [50000, 70000]})

# Concatenating along columns (axis=1)


horizontal_concat = pd.concat([df1, df3], axis=1)
print(horizontal_concat)

Output:
ID Name ID Salary
0 1 Alice 1 50000
1 2 Bob 2 70000
Reshaping in Pandas
Reshaping in Pandas allows us to change the structure of a DataFrame, making it easier to analyze
or visualize data. The key functions for reshaping are:
1. Pivoting (pivot() and pivot_table())
2. Melting (melt())
3. Stacking and Unstacking (stack(), unstack())
4. Reshaping with wide_to_long()

1. Pivoting DataFrames
Pivoting is used to convert rows into columns based on unique values in a column.

Example: Using pivot()


import pandas as pd

# Sample DataFrame
df = pd.DataFrame({
'Date': ['2024-01-01', '2024-01-01', '2024-01-02', '2024-01-02'],
'City': ['New York', 'Los Angeles', 'New York', 'Los Angeles'],
'Temperature': [32, 75, 30, 78]
})

# Pivoting: Converting City names into columns


pivot_df = df.pivot(index='Date', columns='City', values='Temperature')
print(pivot_df)

Output
City Los Angeles New York
Date
2024-01-01 75 32
2024-01-02 78 30

• pivot() reshapes the data so that "City" values become column headers.

Using pivot_table()
pivot_table() is more flexible, allowing aggregation when there are duplicate rows.
df = pd.DataFrame({
'Date': ['2024-01-01', '2024-01-01', '2024-01-02', '2024-01-02', '2024-01-
01'],
'City': ['New York', 'Los Angeles', 'New York', 'Los Angeles', 'New York'],
'Temperature': [32, 75, 30, 78, 35]
})

# Pivot table with average temperature


pivot_table_df = df.pivot_table(index='Date', columns='City',
values='Temperature', aggfunc='mean')
print(pivot_table_df)
• Handles duplicate values by applying an aggregation function like mean().

2. Melting DataFrames (melt())


Melting converts a wide format DataFrame into a long format by turning multiple columns into
row values.

Example: Using melt()


df = pd.DataFrame({
'Date': ['2024-01-01', '2024-01-02'],
'New York': [32, 30],
'Los Angeles': [75, 78]
})

# Melting: Convert city columns back into rows


melted_df = df.melt(id_vars=['Date'], var_name='City', value_name='Temperature')
print(melted_df)

Output
Date City Temperature
0 2024-01-01 New York 32
1 2024-01-02 New York 30
2 2024-01-01 Los Angeles 75
3 2024-01-02 Los Angeles 78

• This is the opposite of pivot(), transforming wide-format data back into a long format.

3. Stacking and Unstacking


• stack(): Converts columns into a hierarchical row index (long format).
• unstack(): Moves the last row index to columns (wide format).

Example: Using stack() and unstack()


df = pd.DataFrame({
'Date': ['2024-01-01', '2024-01-02'],
'New York': [32, 30],
'Los Angeles': [75, 78]
}).set_index('Date')

# Stacking: Converts column headers into row index


stacked_df = df.stack()
print(stacked_df)

Output
Date
2024-01-01 New York 32
Los Angeles 75
2024-01-02 New York 30
Los Angeles 78
dtype: int64

• Cities become part of the row index.


To reverse this, use unstack():
unstacked_df = stacked_df.unstack()
print(unstacked_df)

Output
City Los Angeles New York
Date
2024-01-01 75 32
2024-01-02 78 30

• Restores the original wide format.

4. Reshaping with wide_to_long()


• Used for datasets with multiple columns following a pattern (e.g., "Sales_2019",
"Sales_2020").
• Converts wide-format data into long-format by stacking column groups into rows.

Example: Using wide_to_long()


df = pd.DataFrame({
'Store': ['A', 'B'],
'Sales_2019': [100, 150],
'Sales_2020': [120, 170]
})

# Reshaping with wide_to_long


long_df = pd.wide_to_long(df, stubnames='Sales', i='Store', j='Year', sep='_')
print(long_df)

Output
Sales
Store Year
A 2019 100
A 2020 120
B 2019 150
B 2020 170

• Column headers ("Sales_2019", "Sales_2020") are converted into a single 'Sales'


column with a new 'Year' column.

Summary of Reshaping Methods


Method Purpose Example Use Case
pivot() Converts rows into columns Convert date-wise sales data into a table with
Method Purpose Example Use Case
products as columns
pivot_table(
) Pivot with aggregation Average temperature per city per date

melt() Converts wide format to long Convert city-wise temperature data back into row
format format
stack() Converts columns into row
Reshape sales data into a hierarchical format
index
unstack() Converts row index into Convert long format sales data back into wide
columns format
wide_to_long Converts column groups into Convert "Sales_2019", "Sales_2020" into a single
() rows 'Sales' column
Mapping in Pandas
Mapping in Pandas is used to modify or transform values in a DataFrame or Series based on a
function, dictionary, or another mapping technique. The key methods for mapping in Pandas are:
1. map() – Works on Pandas Series to apply a function or dictionary mapping.
2. apply() – Used for more complex transformations on Series or DataFrame.
3. applymap() – Used to apply a function to every element in a DataFrame.
4. replace() – Used to replace specific values in a DataFrame or Series.

1. Using map() for Series


The map() function is used to transform a Series using a dictionary, function, or another Series.

Example: Mapping Values Using a Dictionary


import pandas as pd

# Creating a DataFrame
df = pd.DataFrame({'ID': [1, 2, 3, 4],
'Department': ['HR', 'IT', 'Finance', 'IT']})

# Mapping Department Names to Department Codes


dept_map = {'HR': 101, 'IT': 102, 'Finance': 103}

# Applying the mapping


df['Dept_Code'] = df['Department'].map(dept_map)
print(df)

Output
ID Department Dept_Code
0 1 HR 101
1 2 IT 102
2 3 Finance 103
3 4 IT 102

• The map() method replaces each department name with its corresponding department code.

Example: Using map() with a Function


# Using a function to modify column values
df['Dept_Length'] = df['Department'].map(lambda x: len(x))
print(df)

Output
ID Department Dept_Code Dept_Length
0 1 HR 101 2
1 2 IT 102 2
2 3 Finance 103 7
3 4 IT 102 2

• This maps each department name to its length using a lambda function.
2. Using apply() for More Complex Transformations
The apply() function is more flexible than map() and can be used on both Series and
DataFrames.

Example: Using apply() on a Series


df['Dept_Upper'] = df['Department'].apply(str.upper)
print(df)

Output
ID Department Dept_Code Dept_Length Dept_Upper
0 1 HR 101 2 HR
1 2 IT 102 2 IT
2 3 Finance 103 7 FINANCE
3 4 IT 102 2 IT

• The apply() function is used to convert all department names to uppercase.

Example: Using apply() on a DataFrame


# Creating a function to format values
def format_id(x):
return f"EMP-{x}"

df['Formatted_ID'] = df['ID'].apply(format_id)
print(df)

Output
ID Department Dept_Code Dept_Length Dept_Upper Formatted_ID
0 1 HR 101 2 HR EMP-1
1 2 IT 102 2 IT EMP-2
2 3 Finance 103 7 FINANCE EMP-3
3 4 IT 102 2 IT EMP-4

• The function format_id() is applied to each row in the "ID" column.

Using apply() on Multiple Columns


# Applying a function to multiple columns
df['Info'] = df.apply(lambda row: f"{row['Department']} - {row['Dept_Code']}",
axis=1)
print(df)

Output
ID Department Dept_Code Dept_Length Dept_Upper Formatted_ID Info
0 1 HR 101 2 HR EMP-1 HR - 101
1 2 IT 102 2 IT EMP-2 IT - 102
2 3 Finance 103 7 FINANCE EMP-3 Finance - 103
3 4 IT 102 2 IT EMP-4 IT - 102

• The apply() function combines multiple columns into a new column.


3. Using applymap() for Element-wise Operations on a
DataFrame
The applymap() function is used to apply a function to every element in a DataFrame.

Example: Applying a Function to Every Element


# Creating a numeric DataFrame
df_numeric = pd.DataFrame({'A': [1, 2, 3], 'B': [4, 5, 6]})

# Applying a function to all elements


df_squared = df_numeric.applymap(lambda x: x ** 2)
print(df_squared)

Output
A B
0 1 16
1 4 25
2 9 36

• Each element is squared using applymap().

4. Using replace() for Value Substitution


The replace() method is useful for replacing specific values in a DataFrame or Series.

Example: Replacing Values in a Series


df['Department'] = df['Department'].replace({'HR': 'Human Resources', 'IT':
'Information Tech'})
print(df)

Output
ID Department Dept_Code Dept_Length Dept_Upper Formatted_ID
Info
0 1 Human Resources 101 2 HR EMP-1 Human
Resources - 101
1 2 Information Tech 102 2 IT EMP-2
Information Tech - 102
2 3 Finance 103 7 FINANCE EMP-3 Finance -
103
3 4 Information Tech 102 2 IT EMP-4
Information Tech - 102

• The values in the "Department" column are replaced with their full names.
Summary of Mapping Functions in Pandas
Method Works On Usage
map() Series Map values using a dictionary or function
apply() Series & DataFrame Apply a function to each element or row/column
applymap() DataFrame Apply a function element-wise
replace() Series & DataFrame Replace specific values
Binning in Pandas
Binning is the process of converting continuous numerical data into discrete intervals (bins). It
helps in data grouping, frequency distribution analysis, and categorization. Pandas provides
two key functions for binning:
1. pd.cut() – Binning into equal-sized or custom bins.
2. pd.qcut() – Binning into quantiles (equal-sized groups based on data distribution).

1. Using cut() for Binning Based on Fixed Intervals


cut() is used to segment a numerical column into defined bins.

Example: Binning Age Groups


import pandas as pd

# Sample Data
df = pd.DataFrame({'Name': ['Alice', 'Bob', 'Charlie', 'David', 'Eve'],
'Age': [23, 45, 37, 50, 29]})

# Define Bin Ranges and Labels


bins = [0, 18, 35, 50, 100] # Ranges: 0-18, 19-35, 36-50, 51-100
labels = ['Teen', 'Young Adult', 'Middle-Aged', 'Senior']

# Apply binning
df['Age Group'] = pd.cut(df['Age'], bins=bins, labels=labels)
print(df)

Output
Name Age Age Group
0 Alice 23 Young Adult
1 Bob 45 Middle-Aged
2 Charlie 37 Middle-Aged
3 David 50 Middle-Aged
4 Eve 29 Young Adult

• The cut() function categorizes each person into an Age Group based on the predefined
bins.

Including Bin Boundaries (right=False)


df['Age Group'] = pd.cut(df['Age'], bins=bins, labels=labels, right=False)
print(df)

• By default, bins include the right boundary (right=True), but setting right=False
makes them left-inclusive.

2. Using qcut() for Binning into Equal-Sized Groups


qcut() divides data into quantiles (equal-sized bins) based on the data distribution.
Example: Binning Salaries into 4 Equal Groups
# Sample Data
df = pd.DataFrame({'Employee': ['A', 'B', 'C', 'D', 'E', 'F'],
'Salary': [30000, 50000, 70000, 90000, 110000, 130000]})

# Apply qcut (4 equal-sized bins)


df['Salary Bracket'] = pd.qcut(df['Salary'], q=4, labels=['Low', 'Medium',
'High', 'Very High'])
print(df)

Output
Employee Salary Salary Bracket
0 A 30000 Low
1 B 50000 Medium
2 C 70000 Medium
3 D 90000 High
4 E 110000 High
5 F 130000 Very High

• qcut() creates bins that each contain approximately the same number of values.
• Unlike cut(), qcut() automatically determines bin edges based on data distribution.

3. Adding a New Column with Binned Data


We can use cut() or qcut() to create new categorical columns for better data analysis.

Example: Categorizing Exam Scores


df = pd.DataFrame({'Student': ['John', 'Emma', 'Lucas', 'Sophia', 'Liam'],
'Score': [55, 88, 72, 91, 45]})

# Define Bins and Labels


bins = [0, 50, 70, 85, 100]
labels = ['Fail', 'Average', 'Good', 'Excellent']

# Apply Binning
df['Performance'] = pd.cut(df['Score'], bins=bins, labels=labels)
print(df)

Output
Student Score Performance
0 John 55 Average
1 Emma 88 Excellent
2 Lucas 72 Good
3 Sophia 91 Excellent
4 Liam 45 Fail

• This categorizes students' scores into performance levels.


4. Getting Bin Intervals and Counts
Get Bin Intervals with retbins=True
bins_result = pd.cut(df['Score'], bins=bins, labels=labels, retbins=True)
print(bins_result[1]) # Display bin edges

Output
[ 0 50 70 85 100]

• Returns the actual bin edges used.

Count Number of Items in Each Bin


bin_counts = pd.cut(df['Score'], bins=bins, labels=labels).value_counts()
print(bin_counts)

Output
Fail 1
Average 1
Good 1
Excellent 2
Name: Score, dtype: int64

• Counts how many values fall into each category.

Summary of Binning in Pandas


Method Description Use Case
pd.cut() Splits data into fixed bins Define custom age groups, salary ranges
pd.qcut() Splits data into equal-sized quantiles Divide scores, incomes, or sales into quartiles
Grouping a DataFrame in Pandas
Grouping in Pandas is done using the groupby() function, which allows you to aggregate,
transform, or filter data based on specific criteria. It is useful for summarizing data, computing
statistics, and organizing data into meaningful groups.

1. Basic groupby() Usage


The groupby() function groups data based on a column's values and applies aggregate
functions like sum(), mean(), count(), etc.

Example: Grouping Sales Data by Product


import pandas as pd

# Sample Data
data = {'Product': ['Laptop', 'Laptop', 'Tablet', 'Tablet', 'Phone', 'Phone'],
'Region': ['East', 'West', 'East', 'West', 'East', 'West'],
'Sales': [1200, 1000, 800, 600, 1500, 1300]}

df = pd.DataFrame(data)

# Grouping by 'Product' and summing Sales


grouped_df = df.groupby('Product')['Sales'].sum()
print(grouped_df)

Output
Product
Laptop 2200
Phone 2800
Tablet 1400
Name: Sales, dtype: int64

• The groupby('Product') groups data by Product and sums the Sales for each
product.

2. Grouping by Multiple Columns


You can group by multiple columns to get more detailed insights.

Example: Grouping by Product and Region


grouped_df = df.groupby(['Product', 'Region'])['Sales'].sum()
print(grouped_df)

Output
Product Region
Laptop East 1200
West 1000
Phone East 1500
West 1300
Tablet East 800
West 600
Name: Sales, dtype: int64

• The data is grouped by Product and Region, showing sales for each region.

3. Applying Aggregate Functions (agg())


You can use multiple aggregation functions using .agg().

Example: Multiple Aggregations


grouped_df = df.groupby('Product').agg({'Sales': ['sum', 'mean', 'count']})
print(grouped_df)

Output
Sales
sum mean count
Product
Laptop 2200 1100 2
Phone 2800 1400 2
Tablet 1400 700 2

• This calculates the sum, mean, and count of sales for each product.

4. Filtering Groups with filter()


The filter() function removes groups that do not meet a certain condition.

Example: Filter Products with Total Sales Over 2000


filtered_df = df.groupby('Product').filter(lambda x: x['Sales'].sum() > 2000)
print(filtered_df)

Output
Product Region Sales
0 Laptop East 1200
1 Laptop West 1000
4 Phone East 1500
5 Phone West 1300

• Only Laptop and Phone remain because their total sales exceed 2000.

5. Transforming Groups with transform()


The transform() function returns a Series of the same size as the original DataFrame, unlike
agg(), which reduces the DataFrame.
Example: Adding a Column for Total Sales Per Product
df['Total Sales'] = df.groupby('Product')['Sales'].transform('sum')
print(df)

Output
Product Region Sales Total Sales
0 Laptop East 1200 2200
1 Laptop West 1000 2200
2 Tablet East 800 1400
3 Tablet West 600 1400
4 Phone East 1500 2800
5 Phone West 1300 2800

• Each row now includes the total sales for its product category.

6. Grouping and Applying Custom Functions


You can apply custom functions to grouped data.

Example: Finding Maximum Sale Per Group


grouped_df = df.groupby('Product')['Sales'].apply(lambda x: x.max())
print(grouped_df)

Output
Product
Laptop 1200
Phone 1500
Tablet 800
Name: Sales, dtype: int64

• Finds the maximum Sales for each product.

7. Grouping with size() to Count Entries


size() returns the number of occurrences in each group.

Example: Counting the Number of Sales Entries per Product


grouped_df = df.groupby('Product').size()
print(grouped_df)

Output
Product
Laptop 2
Phone 2
Tablet 2
dtype: int64

• Each product has 2 sales records.


8. Resetting Index After Grouping
After groupby(), the result often has a multi-level index. Use .reset_index() to convert it
back to a DataFrame.

Example: Resetting Index


grouped_df = df.groupby(['Product', 'Region'])['Sales'].sum().reset_index()
print(grouped_df)

Output
Product Region Sales
0 Laptop East 1200
1 Laptop West 1000
2 Phone East 1500
3 Phone West 1300
4 Tablet East 800
5 Tablet West 600

• The hierarchical index is removed, making the DataFrame easier to work with.

Summary of groupby() in Pandas


Method Description Example Use Case
df.groupby('Product')
groupby() Groups data by column(s)
['Sales'].sum()
Applies multiple df.groupby('Product').agg({'Sales':
agg()
aggregations ['sum', 'mean']})
Filters groups based on a df.groupby('Product').filter(lambda
filter()
condition x: x['Sales'].sum() > 2000)
df['Total Sales'] =
transform( Returns a Series of same
df.groupby('Product')
) size as original DataFrame ['Sales'].transform('sum')

apply() Applies a custom function df.groupby('Product')


to each group ['Sales'].apply(lambda x: x.max())
size() Returns the count of df.groupby('Product').size()
occurrences
reset_inde df.groupby('Product')
x() Resets index after grouping ['Sales'].sum().reset_index()

You might also like