05 Pandas Data Frames
05 Pandas Data Frames
# Example data
data = {
'Name': ['Alice', 'Bob', 'Charlie'],
'Age': [24, 27, 22],
'City': ['New York', 'Los Angeles', 'Chicago']
}
# Create DataFrame
df = pd.DataFrame(data)
print(df)
Output:
Name Age City
0 Alice 24 New York
1 Bob 27 Los Angeles
2 Charlie 22 Chicago
df["Age"]
Out[4]:
0 22
1 35
2 58
In Pandas, there are several ways to create DataFrames depending on the data format you have.
Here are some of the most common methods:
1. From a Dictionary
You can create a DataFrame by passing a dictionary where the keys are column names and the
values are lists or arrays of data.
import pandas as pd
data = {
'Name': ['Jack', 'Bob', 'Tom'],
'Age': [24, 27, 22],
'City': ['Pune', 'Jaipur', 'Mumbai']
}
df = pd.DataFrame(data)
print(df)
# From Excel
df = pd.read_excel('file.xlsx')
6. From a Series
You can create a DataFrame from a Pandas Series. If you have a single column, you can convert it
into a DataFrame.
import pandas as pd
# Creating a series
s = pd.Series([24, 27, 22], index=['Jack', 'Bob', 'Tom'])
7. Using pd.DataFrame.from_records()
This method is useful when you have a list of records (usually dictionaries) and want to convert it
into a DataFrame.
data = [{'Name': 'Jack', 'Age': 24, 'City': 'Pune'},
{'Name': 'Bob', 'Age': 27, 'City': 'Jaipur'},
{'Name': 'Tom', 'Age': 22, 'City': 'Mumbai'}]
df = pd.DataFrame.from_records(data)
print(df)
data = {
'Name': pd.Series(['Jack', 'Bob', 'Tom']),
'Age': pd.Series([24, 27, 22]),
'City': pd.Series(['Pune', 'Jaipur', 'Mumbai'])
}
df = pd.DataFrame(data)
print(df)
df = pd.DataFrame(data)
print(df)
conn = sqlite3.connect('database.db')
query = "SELECT * FROM users"
df = pd.read_sql(query, conn)
print(df)
These are some of the most common ways to create DataFrames in Pandas. Depending on the data
you have, you can choose the method that works best for you!
Here's a simple usage example of creating and working with a Pandas DataFrame using the
dictionary method from above:
Output:
Original DataFrame:
Name Age City
0 Jack 24 Pune
1 Bob 27 Jaipur
2 Tom 22 Mumbai
Age column:
0 24
1 27
2 22
Name: Age, dtype: int64
Explanation:
1. Create a DataFrame: We create a simple DataFrame from a dictionary.
2. Display the DataFrame: Print the DataFrame to view the data.
3. Access a column: Retrieve the 'Age' column.
4. Filter rows: Filter the rows where 'Age' is greater than 23.
5. Add a new column: We add a 'Country' column to the DataFrame.
6. Select specific rows and columns: Use .loc[] to select specific rows (people older than
23) and columns ('Name' and 'City').
This is a basic demonstration of creating, manipulating, and accessing data within a Pandas
DataFrame.
Examples of Deleting Rows and Columns in Pandas DataFrame
You can delete rows and columns using the drop() method.
1. Deleting Columns
Columns can be dropped using df.drop(columns=["col_name"]) or
df.drop("col_name", axis=1).
import pandas as pd
# Sample DataFrame
data = {
"Name": ["Alice", "Bob", "Charlie"],
"Age": [25, 30, 35],
"Salary": [50000, 60000, 70000]
}
df = pd.DataFrame(data)
print("Original DataFrame:\n", df)
Output:
Original DataFrame:
Name Age Salary
0 Alice 25 50000
1 Bob 30 60000
2 Charlie 35 70000
2. Deleting Rows
Rows are dropped using df.drop(index=[row_index]) or df.drop(row_index,
axis=0).
# Sample DataFrame
df = pd.DataFrame(data)
Output:
After Deleting Row with Index 1:
Name Age Salary
0 Alice 25 50000
2 Charlie 35 70000
Output:
After Deleting Rows Where Age > 30:
Name Age Salary
0 Alice 25 50000
1 Bob 30 60000
Renaming Row and Column Labels in a Pandas DataFrame
You can rename columns and row labels (index) using the .rename() method in Pandas.
# Sample DataFrame
data = {
"Name": ["Alice", "Bob", "Charlie"],
"Age": [25, 30, 35],
"Salary": [50000, 60000, 70000]
}
df = pd.DataFrame(data)
print("Original DataFrame:\n", df)
# Renaming Columns
df = df.rename(columns={"Name": "Full Name", "Age": "Years", "Salary":
"Income"})
print("\nAfter Renaming Columns:\n", df)
Output:
Original DataFrame:
Name Age Salary
0 Alice 25 50000
1 Bob 30 60000
2 Charlie 35 70000
Output:
After Renaming Row Labels:
Full Name Years Income
A Alice 25 50000
B Bob 30 60000
C Charlie 35 70000
Conclusion
• Rename Columns → df.rename(columns={"old_name": "new_name"})
• Rename Rows (Index) → df.rename(index={old_index: new_index})
• Rename Both → df.rename(columns=..., index=...)
• Change All at Once → df.columns = [...], df.index = [...]
Attributes of DataFrame
These attributes provide metadata about the DataFrame:
1. df.shape – Returns the dimensions of the DataFrame as (rows, columns).
2. df.size – Returns the total number of elements (rows × columns).
3. df.ndim – Returns the number of dimensions (always 2 for a DataFrame).
4. df.columns – Returns the column labels as an Index object.
5. df.index – Returns the row labels as an Index object.
6. df.dtypes – Returns the data types of each column.
7. df.values – Returns the underlying NumPy array of values.
8. df.info() – Prints metadata, including column types and non-null values.
9. df.T – Transposes the DataFrame (rows become columns and vice versa).
# Example demonstrating the use of DataFrame attributes in Python using
pandas
import pandas as pd
df = pd.DataFrame(data)
Uses of DataFrame
1. Data Manipulation – Adding, updating, or deleting rows/columns.
2. Data Cleaning – Handling missing values, filtering, and replacing data.
3. Data Analysis – Aggregation, grouping, and statistical analysis.
4. Data Transformation – Applying functions, pivoting, and reshaping data.
5. Data Visualization – Plotting data using matplotlib and seaborn.
6. Integration with Databases – Reading from and writing to SQL, CSV, Excel, etc.
7. Machine Learning – Preprocessing and feature engineering for models.
The pandas DataFrame methods head(), tail(), info(),
and describe()
The pandas DataFrame methods head(), tail(), info(), and describe() are essential for
exploring and summarizing data. Below is a detailed explanation with examples.
Example:
import pandas as pd
df = pd.DataFrame(data)
Output:
Name Age Salary
0 Alice 25 50000
1 Bob 30 60000
2 Charlie 35 70000
Useful for checking column names, data types, and first few values.
Example:
print(df.tail(2)) # Display last 2 rows
Output:
Name Age Salary
3 David 40 80000
4 Emma 45 90000
Example:
df.info()
Output:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5 entries, 0 to 4
Data columns (total 3 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 Name 5 non-null object
1 Age 5 non-null int64
2 Salary 5 non-null int64
dtypes: int64(2), object(1)
memory usage: 248.0 bytes
Example:
print(df.describe())
Output:
Age Salary
count 5.000000 5.000000
mean 35.000000 70000.000000
std 7.905694 15811.388301
min 25.000000 50000.000000
25% 30.000000 60000.000000
50% 35.000000 70000.000000
75% 40.000000 80000.000000
max 45.000000 90000.000000
Python code with detailed comments for analyzing employee salary data using
pandas DataFrame methods: head(), tail(), info(), and describe().
# Import pandas library
import pandas as pd
Sure! Here’s a clear explanation of how to work with joining, merging, and concatenation in
Pandas.
Output:
Emp_ID Name Department Salary
0 101 Alice HR 50000
1 102 Bob IT 70000
2 103 Charlie Finance 85000
By default, merge() performs an inner join, keeping only matching records from both
DataFrames.
# Right Join (Keeps all salary records, fills NaN for missing employees)
right_join = pd.merge(employees, salaries, on='Emp_ID', how='right')
# Outer Join (Keeps all records from both tables, fills NaN where data is
missing)
outer_join = pd.merge(employees, salaries, on='Emp_ID', how='outer')
Output:
Manager Avg_Salary
Department
HR John 60000
IT Emma 80000
Finance Michael 90000
This method is useful when working with hierarchical data or index-based tables.
In Pandas, there are four main types of joins used to merge two DataFrames. These correspond to
SQL join operations:
Example
import pandas as pd
2. Left Join
• Keeps all rows from the left DataFrame (df1) and only the matching rows from the right
DataFrame (df2).
• Unmatched rows from the right DataFrame will have NaN values.
Example
left_join = pd.merge(df1, df2, on='ID', how='left')
print(left_join)
Output
ID Name Salary
0 1 Alice NaN
1 2 Bob 70000.0
2 3 Charlie 85000.0
• Row with ID = 1 is kept from df1, but has no match in df2, so Salary = NaN
• Row with ID = 4 from df2 is dropped
3. Right Join
• Keeps all rows from the right DataFrame (df2) and only the matching rows from the left
DataFrame (df1).
• Unmatched rows from the left DataFrame will have NaN values.
Example
right_join = pd.merge(df1, df2, on='ID', how='right')
print(right_join)
Output
ID Name Salary
0 2 Bob 70000
1 3 Charlie 85000
2 4 NaN 90000
• Row with ID = 4 is kept from df2, but has no match in df1, so Name = NaN
• Row with ID = 1 from df1 is dropped
4. Outer Join
• Keeps all rows from both DataFrames.
• If a key exists in one DataFrame but not the other, the missing values are filled with NaN.
Example
outer_join = pd.merge(df1, df2, on='ID', how='outer')
print(outer_join)
Output
ID Name Salary
0 1 Alice NaN
1 2 Bob 70000
2 3 Charlie 85000
3 4 NaN 90000
❌ ❌ ✅
Join Type Keeps All Left Rows? Keeps All Right Rows? Keeps Only Matching Rows?
✅ ❌ ❌
Inner No No Yes
❌ ✅ ❌
Left Yes No No
✅ ✅ ❌
Right No Yes No
Outer Yes Yes No
Output:
ID Name
0 1 Alice
1 2 Bob
0 3 Charlie
1 4 David
The index is not reset automatically. You can use ignore_index=True to fix it.
Example: Concatenating Column-wise (Horizontal Stacking)
# Creating DataFrames with the same number of rows
df3 = pd.DataFrame({'ID': [1, 2], 'Salary': [50000, 70000]})
Output:
ID Name ID Salary
0 1 Alice 1 50000
1 2 Bob 2 70000
Reshaping in Pandas
Reshaping in Pandas allows us to change the structure of a DataFrame, making it easier to analyze
or visualize data. The key functions for reshaping are:
1. Pivoting (pivot() and pivot_table())
2. Melting (melt())
3. Stacking and Unstacking (stack(), unstack())
4. Reshaping with wide_to_long()
1. Pivoting DataFrames
Pivoting is used to convert rows into columns based on unique values in a column.
# Sample DataFrame
df = pd.DataFrame({
'Date': ['2024-01-01', '2024-01-01', '2024-01-02', '2024-01-02'],
'City': ['New York', 'Los Angeles', 'New York', 'Los Angeles'],
'Temperature': [32, 75, 30, 78]
})
Output
City Los Angeles New York
Date
2024-01-01 75 32
2024-01-02 78 30
• pivot() reshapes the data so that "City" values become column headers.
Using pivot_table()
pivot_table() is more flexible, allowing aggregation when there are duplicate rows.
df = pd.DataFrame({
'Date': ['2024-01-01', '2024-01-01', '2024-01-02', '2024-01-02', '2024-01-
01'],
'City': ['New York', 'Los Angeles', 'New York', 'Los Angeles', 'New York'],
'Temperature': [32, 75, 30, 78, 35]
})
Output
Date City Temperature
0 2024-01-01 New York 32
1 2024-01-02 New York 30
2 2024-01-01 Los Angeles 75
3 2024-01-02 Los Angeles 78
• This is the opposite of pivot(), transforming wide-format data back into a long format.
Output
Date
2024-01-01 New York 32
Los Angeles 75
2024-01-02 New York 30
Los Angeles 78
dtype: int64
Output
City Los Angeles New York
Date
2024-01-01 75 32
2024-01-02 78 30
Output
Sales
Store Year
A 2019 100
A 2020 120
B 2019 150
B 2020 170
melt() Converts wide format to long Convert city-wise temperature data back into row
format format
stack() Converts columns into row
Reshape sales data into a hierarchical format
index
unstack() Converts row index into Convert long format sales data back into wide
columns format
wide_to_long Converts column groups into Convert "Sales_2019", "Sales_2020" into a single
() rows 'Sales' column
Mapping in Pandas
Mapping in Pandas is used to modify or transform values in a DataFrame or Series based on a
function, dictionary, or another mapping technique. The key methods for mapping in Pandas are:
1. map() – Works on Pandas Series to apply a function or dictionary mapping.
2. apply() – Used for more complex transformations on Series or DataFrame.
3. applymap() – Used to apply a function to every element in a DataFrame.
4. replace() – Used to replace specific values in a DataFrame or Series.
# Creating a DataFrame
df = pd.DataFrame({'ID': [1, 2, 3, 4],
'Department': ['HR', 'IT', 'Finance', 'IT']})
Output
ID Department Dept_Code
0 1 HR 101
1 2 IT 102
2 3 Finance 103
3 4 IT 102
• The map() method replaces each department name with its corresponding department code.
Output
ID Department Dept_Code Dept_Length
0 1 HR 101 2
1 2 IT 102 2
2 3 Finance 103 7
3 4 IT 102 2
• This maps each department name to its length using a lambda function.
2. Using apply() for More Complex Transformations
The apply() function is more flexible than map() and can be used on both Series and
DataFrames.
Output
ID Department Dept_Code Dept_Length Dept_Upper
0 1 HR 101 2 HR
1 2 IT 102 2 IT
2 3 Finance 103 7 FINANCE
3 4 IT 102 2 IT
df['Formatted_ID'] = df['ID'].apply(format_id)
print(df)
Output
ID Department Dept_Code Dept_Length Dept_Upper Formatted_ID
0 1 HR 101 2 HR EMP-1
1 2 IT 102 2 IT EMP-2
2 3 Finance 103 7 FINANCE EMP-3
3 4 IT 102 2 IT EMP-4
Output
ID Department Dept_Code Dept_Length Dept_Upper Formatted_ID Info
0 1 HR 101 2 HR EMP-1 HR - 101
1 2 IT 102 2 IT EMP-2 IT - 102
2 3 Finance 103 7 FINANCE EMP-3 Finance - 103
3 4 IT 102 2 IT EMP-4 IT - 102
Output
A B
0 1 16
1 4 25
2 9 36
Output
ID Department Dept_Code Dept_Length Dept_Upper Formatted_ID
Info
0 1 Human Resources 101 2 HR EMP-1 Human
Resources - 101
1 2 Information Tech 102 2 IT EMP-2
Information Tech - 102
2 3 Finance 103 7 FINANCE EMP-3 Finance -
103
3 4 Information Tech 102 2 IT EMP-4
Information Tech - 102
• The values in the "Department" column are replaced with their full names.
Summary of Mapping Functions in Pandas
Method Works On Usage
map() Series Map values using a dictionary or function
apply() Series & DataFrame Apply a function to each element or row/column
applymap() DataFrame Apply a function element-wise
replace() Series & DataFrame Replace specific values
Binning in Pandas
Binning is the process of converting continuous numerical data into discrete intervals (bins). It
helps in data grouping, frequency distribution analysis, and categorization. Pandas provides
two key functions for binning:
1. pd.cut() – Binning into equal-sized or custom bins.
2. pd.qcut() – Binning into quantiles (equal-sized groups based on data distribution).
# Sample Data
df = pd.DataFrame({'Name': ['Alice', 'Bob', 'Charlie', 'David', 'Eve'],
'Age': [23, 45, 37, 50, 29]})
# Apply binning
df['Age Group'] = pd.cut(df['Age'], bins=bins, labels=labels)
print(df)
Output
Name Age Age Group
0 Alice 23 Young Adult
1 Bob 45 Middle-Aged
2 Charlie 37 Middle-Aged
3 David 50 Middle-Aged
4 Eve 29 Young Adult
• The cut() function categorizes each person into an Age Group based on the predefined
bins.
• By default, bins include the right boundary (right=True), but setting right=False
makes them left-inclusive.
Output
Employee Salary Salary Bracket
0 A 30000 Low
1 B 50000 Medium
2 C 70000 Medium
3 D 90000 High
4 E 110000 High
5 F 130000 Very High
• qcut() creates bins that each contain approximately the same number of values.
• Unlike cut(), qcut() automatically determines bin edges based on data distribution.
# Apply Binning
df['Performance'] = pd.cut(df['Score'], bins=bins, labels=labels)
print(df)
Output
Student Score Performance
0 John 55 Average
1 Emma 88 Excellent
2 Lucas 72 Good
3 Sophia 91 Excellent
4 Liam 45 Fail
Output
[ 0 50 70 85 100]
Output
Fail 1
Average 1
Good 1
Excellent 2
Name: Score, dtype: int64
# Sample Data
data = {'Product': ['Laptop', 'Laptop', 'Tablet', 'Tablet', 'Phone', 'Phone'],
'Region': ['East', 'West', 'East', 'West', 'East', 'West'],
'Sales': [1200, 1000, 800, 600, 1500, 1300]}
df = pd.DataFrame(data)
Output
Product
Laptop 2200
Phone 2800
Tablet 1400
Name: Sales, dtype: int64
• The groupby('Product') groups data by Product and sums the Sales for each
product.
Output
Product Region
Laptop East 1200
West 1000
Phone East 1500
West 1300
Tablet East 800
West 600
Name: Sales, dtype: int64
• The data is grouped by Product and Region, showing sales for each region.
Output
Sales
sum mean count
Product
Laptop 2200 1100 2
Phone 2800 1400 2
Tablet 1400 700 2
• This calculates the sum, mean, and count of sales for each product.
Output
Product Region Sales
0 Laptop East 1200
1 Laptop West 1000
4 Phone East 1500
5 Phone West 1300
• Only Laptop and Phone remain because their total sales exceed 2000.
Output
Product Region Sales Total Sales
0 Laptop East 1200 2200
1 Laptop West 1000 2200
2 Tablet East 800 1400
3 Tablet West 600 1400
4 Phone East 1500 2800
5 Phone West 1300 2800
• Each row now includes the total sales for its product category.
Output
Product
Laptop 1200
Phone 1500
Tablet 800
Name: Sales, dtype: int64
Output
Product
Laptop 2
Phone 2
Tablet 2
dtype: int64
Output
Product Region Sales
0 Laptop East 1200
1 Laptop West 1000
2 Phone East 1500
3 Phone West 1300
4 Tablet East 800
5 Tablet West 600
• The hierarchical index is removed, making the DataFrame easier to work with.