0% found this document useful (0 votes)

25 views33 pages

05 Pandas Data Frames

A Pandas DataFrame is a two-dimensional, mutable data structure in Python for storing and manipulating tabular data, similar to a spreadsheet. It allows for various operations such as filtering, grouping, and merging, and can be created from different data formats including dictionaries, lists, and external files. Key features include indexing, label-based access, and the ability to handle heterogeneous data types.

Uploaded by

amitgreat19082001

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

25 views33 pages

05 Pandas Data Frames

Uploaded by

amitgreat19082001

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 33

Pandas Data Frames Notes:

A Pandas DataFrame is a 2-dimensional, size-mutable, and potentially heterogeneous tabular data

structure in Python, used for storing and manipulating data. It is part of the Pandas library, which is
widely used for data analysis in Python.
Here are the key features of a DataFrame:
1. Rows and Columns: DataFrames consist of rows and columns, similar to a table or
spreadsheet, where each column can hold data of a different type (e.g., integers, floats,
strings).
2. Indexing: DataFrames have an index for rows and columns, allowing easy access to data by
row and column labels.
3. Label-based and position-based access: You can access data both by using labels (e.g.,
column names, row index) or positions (e.g., row number, column number).
4. Data Manipulation: You can perform various operations on a DataFrame, like filtering,
grouping, merging, reshaping, and handling missing data.

Example of creating a DataFrame:

import pandas as pd

# Example data
data = {
'Name': ['Alice', 'Bob', 'Charlie'],
'Age': [24, 27, 22],
'City': ['New York', 'Los Angeles', 'Chicago']
}

# Create DataFrame
df = pd.DataFrame(data)

print(df)

Output:
Name Age City
0 Alice 24 New York
1 Bob 27 Los Angeles
2 Charlie 22 Chicago

Each column in a DataFrame is a Series

Just interested in working with the data in the column Age

df["Age"]

Out[4]:

0 22
1 35

2 58

Name: Age, dtype: int64

When selecting a single column of a pandas DataFrame, the result is a pandas

Series. To select the column, use the column label in between square brackets
[].

In Pandas, there are several ways to create DataFrames depending on the data format you have.
Here are some of the most common methods:

1. From a Dictionary
You can create a DataFrame by passing a dictionary where the keys are column names and the
values are lists or arrays of data.
import pandas as pd

data = {
'Name': ['Jack', 'Bob', 'Tom'],
'Age': [24, 27, 22],
'City': ['Pune', 'Jaipur', 'Mumbai']
}

df = pd.DataFrame(data)
print(df)

2. From a List of Lists (or Tuples)

You can create a DataFrame by passing a list (or list of tuples) where each element of the list
represents a row of data.
data = [['Jack', 24, 'Pune'], ['Bob', 27, 'Jaipur'], ['Tom', 22, 'Mumbai']]
df = pd.DataFrame(data, columns=['Name', 'Age', 'City'])
print(df)

3. From a List of Dictionaries

You can also create a DataFrame from a list of dictionaries, where each dictionary represents a row,
and the keys are the column names.
data = [{'Name': 'Jack', 'Age': 24, 'City': 'Pune'},
{'Name': 'Bob', 'Age': 27, 'City': 'Jaipur'},
{'Name': 'Tom', 'Age': 22, 'City': 'Mumbai'}]
df = pd.DataFrame(data)
print(df)

4. From a Numpy Array

If you have a NumPy array, you can pass it to pd.DataFrame(), and optionally specify column
names.
import numpy as np

data = np.array([['Jack', 24, 'Pune'], ['Bob', 27, 'Jaipur'], ['Tom', 22,

'Mumbai']])
df = pd.DataFrame(data, columns=['Name', 'Age', 'City'])
print(df)

5. From CSV or Excel Files

You can create a DataFrame by reading data from an external CSV or Excel file.
# From CSV
df = pd.read_csv('file.csv')

# From Excel
df = pd.read_excel('file.xlsx')

6. From a Series
You can create a DataFrame from a Pandas Series. If you have a single column, you can convert it
into a DataFrame.
import pandas as pd

# Creating a series
s = pd.Series([24, 27, 22], index=['Jack', 'Bob', 'Tom'])

# Converting series to DataFrame

df = s.to_frame(name='Age')
print(df)

7. Using pd.DataFrame.from_records()
This method is useful when you have a list of records (usually dictionaries) and want to convert it
into a DataFrame.
data = [{'Name': 'Jack', 'Age': 24, 'City': 'Pune'},
{'Name': 'Bob', 'Age': 27, 'City': 'Jaipur'},
{'Name': 'Tom', 'Age': 22, 'City': 'Mumbai'}]

df = pd.DataFrame.from_records(data)
print(df)

8. From a Dictionary of Series

You can also create a DataFrame by passing a dictionary where each key is the column name, and
the value is a Pandas Series.
import pandas as pd

data = {
'Name': pd.Series(['Jack', 'Bob', 'Tom']),
'Age': pd.Series([24, 27, 22]),
'City': pd.Series(['Pune', 'Jaipur', 'Mumbai'])
}

df = pd.DataFrame(data)
print(df)

9. From a Tuple of Tuples (Multi-Index DataFrame)

You can create a DataFrame with multiple levels of index (MultiIndex) by passing a tuple of tuples.
data = {
('Name', 'First'): ['Jack', 'Bob', 'Tom'],
('Name', 'Last'): ['Smith', 'Johnson', 'Brown'],
'Age': [24, 27, 22],
'City': ['Pune', 'Jaipur', 'Mumbai']
}

df = pd.DataFrame(data)
print(df)

10. From a JSON File

You can read a JSON file and convert it into a DataFrame:
df = pd.read_json('data.json')
print(df)

11. From a SQL Query

You can execute a SQL query on a database and load the result directly into a DataFrame.
import sqlite3

conn = sqlite3.connect('database.db')
query = "SELECT * FROM users"
df = pd.read_sql(query, conn)
print(df)

These are some of the most common ways to create DataFrames in Pandas. Depending on the data
you have, you can choose the method that works best for you!
Here's a simple usage example of creating and working with a Pandas DataFrame using the
dictionary method from above:

Example: Creating and Manipulating a DataFrame

import pandas as pd

# Step 1: Create a DataFrame

data = {
'Name': ['Jack', 'Bob', 'Tom'],
'Age': [24, 27, 22],
'City': ['Pune', 'Jaipur', 'Mumbai']
}

# Create DataFrame from the dictionary

df = pd.DataFrame(data)

# Step 2: Display the DataFrame

print("Original DataFrame:")
print(df)

# Step 3: Access a column

print("\nAge column:")
print(df['Age'])

# Step 4: Filter rows based on a condition

print("\nPeople older than 23:")
print(df[df['Age'] > 23])
# Step 5: Add a new column
df['Country'] = ['India', 'India', 'India']

# Step 6: Display the updated DataFrame

print("\nUpdated DataFrame with Country column:")
print(df)

# Step 7: Select specific rows and columns

print("\nSelect 'Name' and 'City' for people older than 23:")
print(df.loc[df['Age'] > 23, ['Name', 'City']])

Output:
Original DataFrame:
Name Age City
0 Jack 24 Pune
1 Bob 27 Jaipur
2 Tom 22 Mumbai

Age column:
0 24
1 27
2 22
Name: Age, dtype: int64

People older than 23:

Name Age City
0 Jack 24 Pune
1 Bob 27 Jaipur

Updated DataFrame with Country column:

Name Age City Country
0 Jack 24 Pune India
1 Bob 27 Jaipur India
2 Tom 22 Mumbai India

Select 'Name' and 'City' for people older than 23:

Name City
0 Jack Pune
1 Bob Jaipur

Explanation:
1. Create a DataFrame: We create a simple DataFrame from a dictionary.
2. Display the DataFrame: Print the DataFrame to view the data.
3. Access a column: Retrieve the 'Age' column.
4. Filter rows: Filter the rows where 'Age' is greater than 23.
5. Add a new column: We add a 'Country' column to the DataFrame.
6. Select specific rows and columns: Use .loc[] to select specific rows (people older than
23) and columns ('Name' and 'City').

This is a basic demonstration of creating, manipulating, and accessing data within a Pandas
DataFrame.
Examples of Deleting Rows and Columns in Pandas DataFrame
You can delete rows and columns using the drop() method.

1. Deleting Columns
Columns can be dropped using df.drop(columns=["col_name"]) or
df.drop("col_name", axis=1).
import pandas as pd

# Sample DataFrame
data = {
"Name": ["Alice", "Bob", "Charlie"],
"Age": [25, 30, 35],
"Salary": [50000, 60000, 70000]
}

df = pd.DataFrame(data)
print("Original DataFrame:\n", df)

# Deleting a single column

df = df.drop(columns=["Salary"])
print("\nAfter Deleting 'Salary' Column:\n", df)

# Deleting multiple columns

df = df.drop(columns=["Age"])
print("\nAfter Deleting 'Age' Column:\n", df)

Output:
Original DataFrame:
Name Age Salary
0 Alice 25 50000
1 Bob 30 60000
2 Charlie 35 70000

After Deleting 'Salary' Column:

Name Age
0 Alice 25
1 Bob 30
2 Charlie 35

After Deleting 'Age' Column:

Name
0 Alice
1 Bob
2 Charlie

2. Deleting Rows
Rows are dropped using df.drop(index=[row_index]) or df.drop(row_index,
axis=0).
# Sample DataFrame
df = pd.DataFrame(data)

# Deleting a single row by index

df = df.drop(index=[1]) # Drops row at index 1 (Bob)
print("\nAfter Deleting Row with Index 1:\n", df)

# Deleting multiple rows

df = df.drop(index=[0, 2]) # Drops rows at index 0 and 2 (Alice & Charlie)
print("\nAfter Deleting Rows with Index 0 and 2:\n", df)

Output:
After Deleting Row with Index 1:
Name Age Salary
0 Alice 25 50000
2 Charlie 35 70000

After Deleting Rows with Index 0 and 2:

Empty DataFrame
Columns: [Name, Age, Salary]
Index: []

3. Deleting Rows Based on a Condition

You can drop rows based on a condition using boolean indexing.
df = pd.DataFrame(data)

# Delete rows where Age is greater than 30

df = df[df["Age"] <= 30]
print("\nAfter Deleting Rows Where Age > 30:\n", df)

Output:
After Deleting Rows Where Age > 30:
Name Age Salary
0 Alice 25 50000
1 Bob 30 60000
Renaming Row and Column Labels in a Pandas DataFrame
You can rename columns and row labels (index) using the .rename() method in Pandas.

1. Renaming Column Labels

Use df.rename(columns={"old_col_name": "new_col_name"}).

Example: Renaming Columns

import pandas as pd

# Sample DataFrame
data = {
"Name": ["Alice", "Bob", "Charlie"],
"Age": [25, 30, 35],
"Salary": [50000, 60000, 70000]
}

df = pd.DataFrame(data)
print("Original DataFrame:\n", df)

# Renaming Columns
df = df.rename(columns={"Name": "Full Name", "Age": "Years", "Salary":
"Income"})
print("\nAfter Renaming Columns:\n", df)

Output:
Original DataFrame:
Name Age Salary
0 Alice 25 50000
1 Bob 30 60000
2 Charlie 35 70000

After Renaming Columns:

Full Name Years Income
0 Alice 25 50000
1 Bob 30 60000
2 Charlie 35 70000

2. Renaming Row Labels (Index Values)

Use df.rename(index={old_index: new_index}).

Example: Renaming Index

# Renaming Row Labels
df = df.rename(index={0: "A", 1: "B", 2: "C"})
print("\nAfter Renaming Row Labels:\n", df)

Output:
After Renaming Row Labels:
Full Name Years Income
A Alice 25 50000
B Bob 30 60000
C Charlie 35 70000

3. Renaming Both Columns and Index Together

df = df.rename(columns={"Years": "Age"}, index={"A": "Student1", "B":
"Student2"})
print("\nAfter Renaming Both Columns and Index:\n", df)

Alternative: Using .columns and .index Directly

If you want to rename all column names or row labels at once:
df.columns = ["NewName1", "NewName2", "NewName3"]
df.index = ["Row1", "Row2", "Row3"]

Conclusion
• Rename Columns → df.rename(columns={"old_name": "new_name"})
• Rename Rows (Index) → df.rename(index={old_index: new_index})
• Rename Both → df.rename(columns=..., index=...)
• Change All at Once → df.columns = [...], df.index = [...]

A DataFrame is a two-dimensional, mutable data structure in pandas, similar to an Excel

spreadsheet or SQL table. It consists of rows and columns, where columns can have different data
types.

Attributes of DataFrame
These attributes provide metadata about the DataFrame:
1. df.shape – Returns the dimensions of the DataFrame as (rows, columns).
2. df.size – Returns the total number of elements (rows × columns).
3. df.ndim – Returns the number of dimensions (always 2 for a DataFrame).
4. df.columns – Returns the column labels as an Index object.
5. df.index – Returns the row labels as an Index object.
6. df.dtypes – Returns the data types of each column.
7. df.values – Returns the underlying NumPy array of values.
8. df.info() – Prints metadata, including column types and non-null values.
9. df.T – Transposes the DataFrame (rows become columns and vice versa).
# Example demonstrating the use of DataFrame attributes in Python using
pandas

import pandas as pd

# Creating a sample DataFrame

data = {
'Name': ['Alice', 'Bob', 'Charlie'],
'Age': [25, 30, 35],
'Salary': [50000, 60000, 70000]
}

df = pd.DataFrame(data)

# Display the DataFrame

print("DataFrame:")
print(df, "\n")

# Using different attributes

print("Shape of DataFrame:", df.shape) # (rows, columns)
print("Size of DataFrame:", df.size) # Total elements (rows * columns)
print("Number of Dimensions:", df.ndim) # Should always be 2 for DataFrame
print("Column Names:", df.columns) # List of column names
print("Row Index:", df.index) # Index range
print("Data Types of Columns:\n", df.dtypes) # Data types of each column
print("Underlying NumPy array:\n", df.values) # Extract values as array
print("\nDataFrame Info:")
df.info() # Prints summary information about DataFrame

# Transposing the DataFrame

print("\nTransposed DataFrame:")
print(df.T)

Uses of DataFrame
1. Data Manipulation – Adding, updating, or deleting rows/columns.
2. Data Cleaning – Handling missing values, filtering, and replacing data.
3. Data Analysis – Aggregation, grouping, and statistical analysis.
4. Data Transformation – Applying functions, pivoting, and reshaping data.
5. Data Visualization – Plotting data using matplotlib and seaborn.
6. Integration with Databases – Reading from and writing to SQL, CSV, Excel, etc.
7. Machine Learning – Preprocessing and feature engineering for models.
The pandas DataFrame methods head(), tail(), info(),
and describe()
The pandas DataFrame methods head(), tail(), info(), and describe() are essential for
exploring and summarizing data. Below is a detailed explanation with examples.

1. head(n) – Display First n Rows

• Usage: df.head(n)
• Default: If n is not specified, it returns the first 5 rows.
• Purpose: Quickly inspect the top portion of the DataFrame.

Example:
import pandas as pd

# Creating a sample DataFrame

data = {
'Name': ['Alice', 'Bob', 'Charlie', 'David', 'Emma'],
'Age': [25, 30, 35, 40, 45],
'Salary': [50000, 60000, 70000, 80000, 90000]
}

df = pd.DataFrame(data)

# Display first 3 rows

print(df.head(3))

Output:
Name Age Salary
0 Alice 25 50000
1 Bob 30 60000
2 Charlie 35 70000

Useful for checking column names, data types, and first few values.

2. tail(n) – Display Last n Rows

• Usage: df.tail(n)
• Default: If n is not specified, it returns the last 5 rows.
• Purpose: Useful for inspecting the end of the dataset.

Example:
print(df.tail(2)) # Display last 2 rows
Output:
Name Age Salary
3 David 40 80000
4 Emma 45 90000

Helps verify the last few records in the dataset.

3. info() – Summary of DataFrame

• Usage: df.info()
• Purpose: Provides metadata about the DataFrame, including:
• Number of rows and columns
• Column names and data types
• Non-null values per column
• Memory usage

Example:
df.info()

Output:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5 entries, 0 to 4
Data columns (total 3 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 Name 5 non-null object
1 Age 5 non-null int64
2 Salary 5 non-null int64
dtypes: int64(2), object(1)
memory usage: 248.0 bytes

Helps identify data types, missing values, and memory usage.

4. describe() – Statistical Summary

• Usage: df.describe()
• Purpose: Provides summary statistics for numerical columns, including:
• Count (number of non-null values)
• Mean (average value)
• Standard deviation
• Minimum and Maximum values
• 25th, 50th (median), and 75th percentiles

Example:
print(df.describe())
Output:
Age Salary
count 5.000000 5.000000
mean 35.000000 70000.000000
std 7.905694 15811.388301
min 25.000000 50000.000000
25% 30.000000 60000.000000
50% 35.000000 70000.000000
75% 40.000000 80000.000000
max 45.000000 90000.000000

Helps understand data distribution and identify outliers.

Python code with detailed comments for analyzing employee salary data using
pandas DataFrame methods: head(), tail(), info(), and describe().
# Import pandas library
import pandas as pd

# Step 1: Create Sample Employee Data

data = {
'Name': ['Alice', 'Bob', 'Charlie', 'David', 'Emma', 'Frank', 'Grace',
'Henry', 'Ivy', 'Jack'],
'Age': [25, 30, 35, 40, 28, 45, 32, 38, 29, 50], # Employee ages
'Department': ['HR', 'IT', 'Finance', 'IT', 'HR', 'Finance', 'IT', 'HR',
'Finance', 'IT'], # Department names
'Experience': [2, 5, 10, 12, 3, 20, 7, 9, 4, 15], # Work experience in
years
'Salary': [50000, 70000, 85000, 95000, 55000, 120000, 75000, 88000, 62000,
110000] # Annual salaries in $
}

# Create a DataFrame from the dictionary

df = pd.DataFrame(data)

# Step 2: Display First Few Rows Using head()

print("\n First 5 Rows of the Dataset:")
print(df.head()) # By default, displays the first 5 rows

# Step 3: Display Last Few Rows Using tail()

print("\n� Last 5 Rows of the Dataset:")
print(df.tail()) # By default, displays the last 5 rows

# Step 4: Get Dataset Summary Using info()

print("\n� Dataset Information:")
df.info() # Shows structure, data types, and missing values

# Step 5: Get Summary Statistics Using describe()

print("\n Statistical Summary of Numerical Columns:")
print(df.describe()) # Shows statistics for numerical columns (Age, Experience,
Salary)
Working with Joining, Merging, and Concatenation in Pandas
When working with multiple DataFrames, we often need to combine them using different
techniques. Pandas provides three primary ways to achieve this:
1. Merging (merge()) – Similar to SQL joins.
2. Joining (join()) – Works with index-based joins.
3. Concatenation (concat()) – Stacks DataFrames vertically (rows) or horizontally
(columns).

Sure! Here’s a clear explanation of how to work with joining, merging, and concatenation in
Pandas.

1. Merging DataFrames (merge())

Merging is similar to SQL joins and is used to combine DataFrames based on a common column.

Example: Merge Using a Common Column

import pandas as pd

# Creating first DataFrame

employees = pd.DataFrame({
'Emp_ID': [101, 102, 103, 104],
'Name': ['Alice', 'Bob', 'Charlie', 'David'],
'Department': ['HR', 'IT', 'Finance', 'IT']
})

# Creating second DataFrame

salaries = pd.DataFrame({
'Emp_ID': [101, 102, 103, 105],
'Salary': [50000, 70000, 85000, 90000]
})

# Merging on 'Emp_ID' (Inner Join by default)

merged_df = pd.merge(employees, salaries, on='Emp_ID')
print(merged_df)

Output:
Emp_ID Name Department Salary
0 101 Alice HR 50000
1 102 Bob IT 70000
2 103 Charlie Finance 85000

By default, merge() performs an inner join, keeping only matching records from both
DataFrames.

Different Types of Joins in merge()

# Left Join (Keeps all employees, fills NaN for missing salaries)
left_join = pd.merge(employees, salaries, on='Emp_ID', how='left')

# Right Join (Keeps all salary records, fills NaN for missing employees)
right_join = pd.merge(employees, salaries, on='Emp_ID', how='right')

# Outer Join (Keeps all records from both tables, fills NaN where data is
missing)
outer_join = pd.merge(employees, salaries, on='Emp_ID', how='outer')

2. Joining DataFrames (join())

The join() method is used to merge DataFrames based on the index instead of a column.

Example: Join Using Index

# Creating first DataFrame with index
dept_df = pd.DataFrame({
'Department': ['HR', 'IT', 'Finance'],
'Manager': ['John', 'Emma', 'Michael']
}).set_index('Department')

# Creating second DataFrame with index

salary_df = pd.DataFrame({
'Department': ['HR', 'IT', 'Finance'],
'Avg_Salary': [60000, 80000, 90000]
}).set_index('Department')

# Joining DataFrames on index

joined_df = dept_df.join(salary_df)
print(joined_df)

Output:
Manager Avg_Salary
Department
HR John 60000
IT Emma 80000
Finance Michael 90000

This method is useful when working with hierarchical data or index-based tables.
In Pandas, there are four main types of joins used to merge two DataFrames. These correspond to
SQL join operations:

1. Inner Join (Default in merge())

• Keeps only the matching rows from both DataFrames.
• If a key exists in one DataFrame but not the other, it is excluded.

Example
import pandas as pd

df1 = pd.DataFrame({'ID': [1, 2, 3], 'Name': ['Alice', 'Bob', 'Charlie']})

df2 = pd.DataFrame({'ID': [2, 3, 4], 'Salary': [70000, 85000, 90000]})

inner_join = pd.merge(df1, df2, on='ID', how='inner')

print(inner_join)
Output
ID Name Salary
0 2 Bob 70000
1 3 Charlie 85000

• Row with ID = 1 from df1 is dropped (not in df2)

• Row with ID = 4 from df2 is dropped (not in df1)

2. Left Join
• Keeps all rows from the left DataFrame (df1) and only the matching rows from the right
DataFrame (df2).
• Unmatched rows from the right DataFrame will have NaN values.

Example
left_join = pd.merge(df1, df2, on='ID', how='left')
print(left_join)

Output
ID Name Salary
0 1 Alice NaN
1 2 Bob 70000.0
2 3 Charlie 85000.0

• Row with ID = 1 is kept from df1, but has no match in df2, so Salary = NaN
• Row with ID = 4 from df2 is dropped

3. Right Join
• Keeps all rows from the right DataFrame (df2) and only the matching rows from the left
DataFrame (df1).
• Unmatched rows from the left DataFrame will have NaN values.

Example
right_join = pd.merge(df1, df2, on='ID', how='right')
print(right_join)

Output
ID Name Salary
0 2 Bob 70000
1 3 Charlie 85000
2 4 NaN 90000

• Row with ID = 4 is kept from df2, but has no match in df1, so Name = NaN
• Row with ID = 1 from df1 is dropped
4. Outer Join
• Keeps all rows from both DataFrames.
• If a key exists in one DataFrame but not the other, the missing values are filled with NaN.

Example
outer_join = pd.merge(df1, df2, on='ID', how='outer')
print(outer_join)

Output
ID Name Salary
0 1 Alice NaN
1 2 Bob 70000
2 3 Charlie 85000
3 4 NaN 90000

• All records from both DataFrames are retained

• Missing values are filled with NaN

Comparison of Join Types

❌ ❌ ✅
Join Type Keeps All Left Rows? Keeps All Right Rows? Keeps Only Matching Rows?

✅ ❌ ❌
Inner No No Yes

❌ ✅ ❌
Left Yes No No

✅ ✅ ❌
Right No Yes No
Outer Yes Yes No

3. Concatenating DataFrames (concat())

Concatenation is used to stack DataFrames vertically (rows) or horizontally (columns).

Example: Concatenating Row-wise (Vertical Stacking)

# Creating two DataFrames with the same columns
df1 = pd.DataFrame({'ID': [1, 2], 'Name': ['Alice', 'Bob']})
df2 = pd.DataFrame({'ID': [3, 4], 'Name': ['Charlie', 'David']})

# Concatenating along rows (axis=0)

vertical_concat = pd.concat([df1, df2])
print(vertical_concat)

Output:
ID Name
0 1 Alice
1 2 Bob
0 3 Charlie
1 4 David

The index is not reset automatically. You can use ignore_index=True to fix it.
Example: Concatenating Column-wise (Horizontal Stacking)
# Creating DataFrames with the same number of rows
df3 = pd.DataFrame({'ID': [1, 2], 'Salary': [50000, 70000]})

# Concatenating along columns (axis=1)

horizontal_concat = pd.concat([df1, df3], axis=1)
print(horizontal_concat)

Output:
ID Name ID Salary
0 1 Alice 1 50000
1 2 Bob 2 70000
Reshaping in Pandas
Reshaping in Pandas allows us to change the structure of a DataFrame, making it easier to analyze
or visualize data. The key functions for reshaping are:
1. Pivoting (pivot() and pivot_table())
2. Melting (melt())
3. Stacking and Unstacking (stack(), unstack())
4. Reshaping with wide_to_long()

1. Pivoting DataFrames
Pivoting is used to convert rows into columns based on unique values in a column.

Example: Using pivot()

import pandas as pd

# Sample DataFrame
df = pd.DataFrame({
'Date': ['2024-01-01', '2024-01-01', '2024-01-02', '2024-01-02'],
'City': ['New York', 'Los Angeles', 'New York', 'Los Angeles'],
'Temperature': [32, 75, 30, 78]
})

# Pivoting: Converting City names into columns

pivot_df = df.pivot(index='Date', columns='City', values='Temperature')
print(pivot_df)

Output
City Los Angeles New York
Date
2024-01-01 75 32
2024-01-02 78 30

• pivot() reshapes the data so that "City" values become column headers.

Using pivot_table()
pivot_table() is more flexible, allowing aggregation when there are duplicate rows.
df = pd.DataFrame({
'Date': ['2024-01-01', '2024-01-01', '2024-01-02', '2024-01-02', '2024-01-
01'],
'City': ['New York', 'Los Angeles', 'New York', 'Los Angeles', 'New York'],
'Temperature': [32, 75, 30, 78, 35]
})

# Pivot table with average temperature

pivot_table_df = df.pivot_table(index='Date', columns='City',
values='Temperature', aggfunc='mean')
print(pivot_table_df)
• Handles duplicate values by applying an aggregation function like mean().

2. Melting DataFrames (melt())

Melting converts a wide format DataFrame into a long format by turning multiple columns into
row values.

Example: Using melt()

df = pd.DataFrame({
'Date': ['2024-01-01', '2024-01-02'],
'New York': [32, 30],
'Los Angeles': [75, 78]
})

# Melting: Convert city columns back into rows

melted_df = df.melt(id_vars=['Date'], var_name='City', value_name='Temperature')
print(melted_df)

Output
Date City Temperature
0 2024-01-01 New York 32
1 2024-01-02 New York 30
2 2024-01-01 Los Angeles 75
3 2024-01-02 Los Angeles 78

• This is the opposite of pivot(), transforming wide-format data back into a long format.

3. Stacking and Unstacking

• stack(): Converts columns into a hierarchical row index (long format).
• unstack(): Moves the last row index to columns (wide format).

Example: Using stack() and unstack()

df = pd.DataFrame({
'Date': ['2024-01-01', '2024-01-02'],
'New York': [32, 30],
'Los Angeles': [75, 78]
}).set_index('Date')

# Stacking: Converts column headers into row index

stacked_df = df.stack()
print(stacked_df)

Output
Date
2024-01-01 New York 32
Los Angeles 75
2024-01-02 New York 30
Los Angeles 78
dtype: int64

• Cities become part of the row index.

To reverse this, use unstack():
unstacked_df = stacked_df.unstack()
print(unstacked_df)

Output
City Los Angeles New York
Date
2024-01-01 75 32
2024-01-02 78 30

• Restores the original wide format.

4. Reshaping with wide_to_long()

• Used for datasets with multiple columns following a pattern (e.g., "Sales_2019",
"Sales_2020").
• Converts wide-format data into long-format by stacking column groups into rows.

Example: Using wide_to_long()

df = pd.DataFrame({
'Store': ['A', 'B'],
'Sales_2019': [100, 150],
'Sales_2020': [120, 170]
})

# Reshaping with wide_to_long

long_df = pd.wide_to_long(df, stubnames='Sales', i='Store', j='Year', sep='_')
print(long_df)

Output
Sales
Store Year
A 2019 100
A 2020 120
B 2019 150
B 2020 170

• Column headers ("Sales_2019", "Sales_2020") are converted into a single 'Sales'

column with a new 'Year' column.

Summary of Reshaping Methods

Method Purpose Example Use Case
pivot() Converts rows into columns Convert date-wise sales data into a table with
Method Purpose Example Use Case
products as columns
pivot_table(
) Pivot with aggregation Average temperature per city per date

melt() Converts wide format to long Convert city-wise temperature data back into row
format format
stack() Converts columns into row
Reshape sales data into a hierarchical format
index
unstack() Converts row index into Convert long format sales data back into wide
columns format
wide_to_long Converts column groups into Convert "Sales_2019", "Sales_2020" into a single
() rows 'Sales' column
Mapping in Pandas
Mapping in Pandas is used to modify or transform values in a DataFrame or Series based on a
function, dictionary, or another mapping technique. The key methods for mapping in Pandas are:
1. map() – Works on Pandas Series to apply a function or dictionary mapping.
2. apply() – Used for more complex transformations on Series or DataFrame.
3. applymap() – Used to apply a function to every element in a DataFrame.
4. replace() – Used to replace specific values in a DataFrame or Series.

1. Using map() for Series

The map() function is used to transform a Series using a dictionary, function, or another Series.

Example: Mapping Values Using a Dictionary

import pandas as pd

# Creating a DataFrame
df = pd.DataFrame({'ID': [1, 2, 3, 4],
'Department': ['HR', 'IT', 'Finance', 'IT']})

# Mapping Department Names to Department Codes

dept_map = {'HR': 101, 'IT': 102, 'Finance': 103}

# Applying the mapping

df['Dept_Code'] = df['Department'].map(dept_map)
print(df)

Output
ID Department Dept_Code
0 1 HR 101
1 2 IT 102
2 3 Finance 103
3 4 IT 102

• The map() method replaces each department name with its corresponding department code.

Example: Using map() with a Function

# Using a function to modify column values
df['Dept_Length'] = df['Department'].map(lambda x: len(x))
print(df)

Output
ID Department Dept_Code Dept_Length
0 1 HR 101 2
1 2 IT 102 2
2 3 Finance 103 7
3 4 IT 102 2

• This maps each department name to its length using a lambda function.
2. Using apply() for More Complex Transformations
The apply() function is more flexible than map() and can be used on both Series and
DataFrames.

Example: Using apply() on a Series

df['Dept_Upper'] = df['Department'].apply(str.upper)
print(df)

Output
ID Department Dept_Code Dept_Length Dept_Upper
0 1 HR 101 2 HR
1 2 IT 102 2 IT
2 3 Finance 103 7 FINANCE
3 4 IT 102 2 IT

• The apply() function is used to convert all department names to uppercase.

Example: Using apply() on a DataFrame

# Creating a function to format values
def format_id(x):
return f"EMP-{x}"

df['Formatted_ID'] = df['ID'].apply(format_id)
print(df)

Output
ID Department Dept_Code Dept_Length Dept_Upper Formatted_ID
0 1 HR 101 2 HR EMP-1
1 2 IT 102 2 IT EMP-2
2 3 Finance 103 7 FINANCE EMP-3
3 4 IT 102 2 IT EMP-4

• The function format_id() is applied to each row in the "ID" column.

Using apply() on Multiple Columns

# Applying a function to multiple columns
df['Info'] = df.apply(lambda row: f"{row['Department']} - {row['Dept_Code']}",
axis=1)
print(df)

Output
ID Department Dept_Code Dept_Length Dept_Upper Formatted_ID Info
0 1 HR 101 2 HR EMP-1 HR - 101
1 2 IT 102 2 IT EMP-2 IT - 102
2 3 Finance 103 7 FINANCE EMP-3 Finance - 103
3 4 IT 102 2 IT EMP-4 IT - 102

• The apply() function combines multiple columns into a new column.

3. Using applymap() for Element-wise Operations on a
DataFrame
The applymap() function is used to apply a function to every element in a DataFrame.

Example: Applying a Function to Every Element

# Creating a numeric DataFrame
df_numeric = pd.DataFrame({'A': [1, 2, 3], 'B': [4, 5, 6]})

# Applying a function to all elements

df_squared = df_numeric.applymap(lambda x: x ** 2)
print(df_squared)

Output
A B
0 1 16
1 4 25
2 9 36

• Each element is squared using applymap().

4. Using replace() for Value Substitution

The replace() method is useful for replacing specific values in a DataFrame or Series.

Example: Replacing Values in a Series

df['Department'] = df['Department'].replace({'HR': 'Human Resources', 'IT':
'Information Tech'})
print(df)

Output
ID Department Dept_Code Dept_Length Dept_Upper Formatted_ID
Info
0 1 Human Resources 101 2 HR EMP-1 Human
Resources - 101
1 2 Information Tech 102 2 IT EMP-2
Information Tech - 102
2 3 Finance 103 7 FINANCE EMP-3 Finance -
103
3 4 Information Tech 102 2 IT EMP-4
Information Tech - 102

• The values in the "Department" column are replaced with their full names.
Summary of Mapping Functions in Pandas
Method Works On Usage
map() Series Map values using a dictionary or function
apply() Series & DataFrame Apply a function to each element or row/column
applymap() DataFrame Apply a function element-wise
replace() Series & DataFrame Replace specific values
Binning in Pandas
Binning is the process of converting continuous numerical data into discrete intervals (bins). It
helps in data grouping, frequency distribution analysis, and categorization. Pandas provides
two key functions for binning:
1. pd.cut() – Binning into equal-sized or custom bins.
2. pd.qcut() – Binning into quantiles (equal-sized groups based on data distribution).

1. Using cut() for Binning Based on Fixed Intervals

cut() is used to segment a numerical column into defined bins.

Example: Binning Age Groups

import pandas as pd

# Sample Data
df = pd.DataFrame({'Name': ['Alice', 'Bob', 'Charlie', 'David', 'Eve'],
'Age': [23, 45, 37, 50, 29]})

# Define Bin Ranges and Labels

bins = [0, 18, 35, 50, 100] # Ranges: 0-18, 19-35, 36-50, 51-100
labels = ['Teen', 'Young Adult', 'Middle-Aged', 'Senior']

# Apply binning
df['Age Group'] = pd.cut(df['Age'], bins=bins, labels=labels)
print(df)

Output
Name Age Age Group
0 Alice 23 Young Adult
1 Bob 45 Middle-Aged
2 Charlie 37 Middle-Aged
3 David 50 Middle-Aged
4 Eve 29 Young Adult

• The cut() function categorizes each person into an Age Group based on the predefined
bins.

Including Bin Boundaries (right=False)

df['Age Group'] = pd.cut(df['Age'], bins=bins, labels=labels, right=False)
print(df)

• By default, bins include the right boundary (right=True), but setting right=False
makes them left-inclusive.

2. Using qcut() for Binning into Equal-Sized Groups

qcut() divides data into quantiles (equal-sized bins) based on the data distribution.
Example: Binning Salaries into 4 Equal Groups
# Sample Data
df = pd.DataFrame({'Employee': ['A', 'B', 'C', 'D', 'E', 'F'],
'Salary': [30000, 50000, 70000, 90000, 110000, 130000]})

# Apply qcut (4 equal-sized bins)

df['Salary Bracket'] = pd.qcut(df['Salary'], q=4, labels=['Low', 'Medium',
'High', 'Very High'])
print(df)

Output
Employee Salary Salary Bracket
0 A 30000 Low
1 B 50000 Medium
2 C 70000 Medium
3 D 90000 High
4 E 110000 High
5 F 130000 Very High

• qcut() creates bins that each contain approximately the same number of values.
• Unlike cut(), qcut() automatically determines bin edges based on data distribution.

3. Adding a New Column with Binned Data

We can use cut() or qcut() to create new categorical columns for better data analysis.

Example: Categorizing Exam Scores

df = pd.DataFrame({'Student': ['John', 'Emma', 'Lucas', 'Sophia', 'Liam'],
'Score': [55, 88, 72, 91, 45]})

# Define Bins and Labels

bins = [0, 50, 70, 85, 100]
labels = ['Fail', 'Average', 'Good', 'Excellent']

# Apply Binning
df['Performance'] = pd.cut(df['Score'], bins=bins, labels=labels)
print(df)

Output
Student Score Performance
0 John 55 Average
1 Emma 88 Excellent
2 Lucas 72 Good
3 Sophia 91 Excellent
4 Liam 45 Fail

• This categorizes students' scores into performance levels.

4. Getting Bin Intervals and Counts
Get Bin Intervals with retbins=True
bins_result = pd.cut(df['Score'], bins=bins, labels=labels, retbins=True)
print(bins_result[1]) # Display bin edges

Output
[ 0 50 70 85 100]

• Returns the actual bin edges used.

Count Number of Items in Each Bin

bin_counts = pd.cut(df['Score'], bins=bins, labels=labels).value_counts()
print(bin_counts)

Output
Fail 1
Average 1
Good 1
Excellent 2
Name: Score, dtype: int64

• Counts how many values fall into each category.

Summary of Binning in Pandas

Method Description Use Case
pd.cut() Splits data into fixed bins Define custom age groups, salary ranges
pd.qcut() Splits data into equal-sized quantiles Divide scores, incomes, or sales into quartiles
Grouping a DataFrame in Pandas
Grouping in Pandas is done using the groupby() function, which allows you to aggregate,
transform, or filter data based on specific criteria. It is useful for summarizing data, computing
statistics, and organizing data into meaningful groups.

1. Basic groupby() Usage

The groupby() function groups data based on a column's values and applies aggregate
functions like sum(), mean(), count(), etc.

Example: Grouping Sales Data by Product

import pandas as pd

# Sample Data
data = {'Product': ['Laptop', 'Laptop', 'Tablet', 'Tablet', 'Phone', 'Phone'],
'Region': ['East', 'West', 'East', 'West', 'East', 'West'],
'Sales': [1200, 1000, 800, 600, 1500, 1300]}

df = pd.DataFrame(data)

# Grouping by 'Product' and summing Sales

grouped_df = df.groupby('Product')['Sales'].sum()
print(grouped_df)

Output
Product
Laptop 2200
Phone 2800
Tablet 1400
Name: Sales, dtype: int64

• The groupby('Product') groups data by Product and sums the Sales for each
product.

2. Grouping by Multiple Columns

You can group by multiple columns to get more detailed insights.

Example: Grouping by Product and Region

grouped_df = df.groupby(['Product', 'Region'])['Sales'].sum()
print(grouped_df)

Output
Product Region
Laptop East 1200
West 1000
Phone East 1500
West 1300
Tablet East 800
West 600
Name: Sales, dtype: int64

• The data is grouped by Product and Region, showing sales for each region.

3. Applying Aggregate Functions (agg())

You can use multiple aggregation functions using .agg().

Example: Multiple Aggregations

grouped_df = df.groupby('Product').agg({'Sales': ['sum', 'mean', 'count']})
print(grouped_df)

Output
Sales
sum mean count
Product
Laptop 2200 1100 2
Phone 2800 1400 2
Tablet 1400 700 2

• This calculates the sum, mean, and count of sales for each product.

4. Filtering Groups with filter()

The filter() function removes groups that do not meet a certain condition.

Example: Filter Products with Total Sales Over 2000

filtered_df = df.groupby('Product').filter(lambda x: x['Sales'].sum() > 2000)
print(filtered_df)

Output
Product Region Sales
0 Laptop East 1200
1 Laptop West 1000
4 Phone East 1500
5 Phone West 1300

• Only Laptop and Phone remain because their total sales exceed 2000.

5. Transforming Groups with transform()

The transform() function returns a Series of the same size as the original DataFrame, unlike
agg(), which reduces the DataFrame.
Example: Adding a Column for Total Sales Per Product
df['Total Sales'] = df.groupby('Product')['Sales'].transform('sum')
print(df)

Output
Product Region Sales Total Sales
0 Laptop East 1200 2200
1 Laptop West 1000 2200
2 Tablet East 800 1400
3 Tablet West 600 1400
4 Phone East 1500 2800
5 Phone West 1300 2800

• Each row now includes the total sales for its product category.

6. Grouping and Applying Custom Functions

You can apply custom functions to grouped data.

Example: Finding Maximum Sale Per Group

grouped_df = df.groupby('Product')['Sales'].apply(lambda x: x.max())
print(grouped_df)

Output
Product
Laptop 1200
Phone 1500
Tablet 800
Name: Sales, dtype: int64

• Finds the maximum Sales for each product.

7. Grouping with size() to Count Entries

size() returns the number of occurrences in each group.

Example: Counting the Number of Sales Entries per Product

grouped_df = df.groupby('Product').size()
print(grouped_df)

Output
Product
Laptop 2
Phone 2
Tablet 2
dtype: int64

• Each product has 2 sales records.

8. Resetting Index After Grouping
After groupby(), the result often has a multi-level index. Use .reset_index() to convert it
back to a DataFrame.

Example: Resetting Index

grouped_df = df.groupby(['Product', 'Region'])['Sales'].sum().reset_index()
print(grouped_df)

Output
Product Region Sales
0 Laptop East 1200
1 Laptop West 1000
2 Phone East 1500
3 Phone West 1300
4 Tablet East 800
5 Tablet West 600

• The hierarchical index is removed, making the DataFrame easier to work with.

Summary of groupby() in Pandas

Method Description Example Use Case
df.groupby('Product')
groupby() Groups data by column(s)
['Sales'].sum()
Applies multiple df.groupby('Product').agg({'Sales':
agg()
aggregations ['sum', 'mean']})
Filters groups based on a df.groupby('Product').filter(lambda
filter()
condition x: x['Sales'].sum() > 2000)
df['Total Sales'] =
transform( Returns a Series of same
df.groupby('Product')
) size as original DataFrame ['Sales'].transform('sum')

apply() Applies a custom function df.groupby('Product')

to each group ['Sales'].apply(lambda x: x.max())
size() Returns the count of df.groupby('Product').size()
occurrences
reset_inde df.groupby('Product')
x() Resets index after grouping ['Sales'].sum().reset_index()

Mycart Documentation Add Jar /home/acadgild/ecommerce/hive-serdes-1.0-SNAPSHOT - Jar Products - Info - Raw Table Creation
No ratings yet
Mycart Documentation Add Jar /home/acadgild/ecommerce/hive-serdes-1.0-SNAPSHOT - Jar Products - Info - Raw Table Creation
36 pages
Lab3 - Python - Pandas DataFrame - GeeksforGeeks
No ratings yet
Lab3 - Python - Pandas DataFrame - GeeksforGeeks
20 pages
Practical No:-4 Aim: - Set Operators, Nested Queries, Joins, Sequences Set operators:-UNION Operation: UNION Is Used To Combine The Results of Two or
No ratings yet
Practical No:-4 Aim: - Set Operators, Nested Queries, Joins, Sequences Set operators:-UNION Operation: UNION Is Used To Combine The Results of Two or
11 pages
MySQL Fast Track Course
No ratings yet
MySQL Fast Track Course
39 pages
Python Notes by Prof T
No ratings yet
Python Notes by Prof T
10 pages
Abinitio Material
No ratings yet
Abinitio Material
11 pages
8CS4-21 BDA Lab - Dr. Varun P Saxena
100% (1)
8CS4-21 BDA Lab - Dr. Varun P Saxena
37 pages
Etl Interview Questions
100% (1)
Etl Interview Questions
4 pages
20 Pandas Functions For 80% of Your Data Science
No ratings yet
20 Pandas Functions For 80% of Your Data Science
22 pages
Python Pandas ch-2
No ratings yet
Python Pandas ch-2
56 pages
Part 3,4,5,6 & 7: Informatica: Informatica Overview and Transformations
No ratings yet
Part 3,4,5,6 & 7: Informatica: Informatica Overview and Transformations
35 pages
Exp1 - Manipulating Datasets Using Pandas
No ratings yet
Exp1 - Manipulating Datasets Using Pandas
15 pages
Chapter 2 Data Handling Using Pandas - I (DATA FRAME)
No ratings yet
Chapter 2 Data Handling Using Pandas - I (DATA FRAME)
15 pages
C Hamod 2404-Demo
No ratings yet
C Hamod 2404-Demo
5 pages
Pandas Dataframe
No ratings yet
Pandas Dataframe
48 pages
Pandas, Numpy, Matplotlib
No ratings yet
Pandas, Numpy, Matplotlib
11 pages
DataBase Management System (DBMS) (Chapter - SQL) Solved MCQs (Set-2)
No ratings yet
DataBase Management System (DBMS) (Chapter - SQL) Solved MCQs (Set-2)
6 pages
Sql-Joins
No ratings yet
Sql-Joins
12 pages
CSE544: SQL: Monday 3/27 and Wednesday 3/29, 2006
No ratings yet
CSE544: SQL: Monday 3/27 and Wednesday 3/29, 2006
78 pages
Python Pandas New Sylabus
No ratings yet
Python Pandas New Sylabus
53 pages
Day64 - Pandas Interview Questions
No ratings yet
Day64 - Pandas Interview Questions
5 pages
Lecture 9 Pandas
No ratings yet
Lecture 9 Pandas
176 pages
Python Pandas Module - Introduction-07-11-2023
No ratings yet
Python Pandas Module - Introduction-07-11-2023
84 pages
Joins 1
No ratings yet
Joins 1
5 pages
Data Frames
No ratings yet
Data Frames
60 pages
Oracle Basic Concepts
No ratings yet
Oracle Basic Concepts
29 pages
Pandas Handbook
No ratings yet
Pandas Handbook
33 pages
DBMS Detailed Notes
No ratings yet
DBMS Detailed Notes
57 pages
Data Frame
No ratings yet
Data Frame
95 pages
087 Khushboo
No ratings yet
087 Khushboo
40 pages
Unit 4
No ratings yet
Unit 4
36 pages
Lab Exercise #4
No ratings yet
Lab Exercise #4
22 pages
Dataframe Ip
No ratings yet
Dataframe Ip
75 pages
Informatica Scenario Based Interview Questions With Answers
No ratings yet
Informatica Scenario Based Interview Questions With Answers
10 pages
The Pandas Library
No ratings yet
The Pandas Library
39 pages
DBMS Sample Practical File
No ratings yet
DBMS Sample Practical File
33 pages
SBLC 1
No ratings yet
SBLC 1
23 pages
7 Structured Query Language PDF
No ratings yet
7 Structured Query Language PDF
34 pages
Python 3rd Unit Question and Answer
No ratings yet
Python 3rd Unit Question and Answer
25 pages
Data Handling Using Pandas-1
No ratings yet
Data Handling Using Pandas-1
60 pages
Chapter 1 Python Pandas - I
No ratings yet
Chapter 1 Python Pandas - I
35 pages
Starting Out With Pandas - Ext
No ratings yet
Starting Out With Pandas - Ext
18 pages
18 Pandas
No ratings yet
18 Pandas
33 pages
Pandas
No ratings yet
Pandas
16 pages
Unit 2 notes-II
No ratings yet
Unit 2 notes-II
47 pages
Cardinality and Selectivity
No ratings yet
Cardinality and Selectivity
14 pages
Pandas DataFrame
No ratings yet
Pandas DataFrame
70 pages
Pandas
No ratings yet
Pandas
13 pages
Data Aggregation and Group Operations
No ratings yet
Data Aggregation and Group Operations
34 pages
SQL Objective Ques
No ratings yet
SQL Objective Ques
27 pages
Pandas - Dataframe - Introduction
No ratings yet
Pandas - Dataframe - Introduction
16 pages
Dataframe PDF
No ratings yet
Dataframe PDF
14 pages
FDS Module 2 Notes
No ratings yet
FDS Module 2 Notes
24 pages
Chapter 1 - Part 2 - DataFrame
No ratings yet
Chapter 1 - Part 2 - DataFrame
48 pages
DataFrame Notes1
No ratings yet
DataFrame Notes1
32 pages
DAP 3 Module
No ratings yet
DAP 3 Module
62 pages
Pandas
No ratings yet
Pandas
25 pages
Class 12 Panda Project
No ratings yet
Class 12 Panda Project
13 pages
Ch-7 MYSQL
No ratings yet
Ch-7 MYSQL
15 pages
DBMS 1st 10 Q & A.
No ratings yet
DBMS 1st 10 Q & A.
25 pages
14oct Pandas 2024
No ratings yet
14oct Pandas 2024
13 pages
Chapter Notes - Data Handling Using Pandas DataFrame
No ratings yet
Chapter Notes - Data Handling Using Pandas DataFrame
16 pages
DevOps Session 3 Pandas
No ratings yet
DevOps Session 3 Pandas
33 pages
Pandas
No ratings yet
Pandas
12 pages
SQL-3-Querying Multiple Tables
No ratings yet
SQL-3-Querying Multiple Tables
13 pages
Pandas Dataframe Export The CSV File
No ratings yet
Pandas Dataframe Export The CSV File
9 pages
UNIT II Notes
No ratings yet
UNIT II Notes
23 pages
Working With Panda
No ratings yet
Working With Panda
13 pages
Pandas
No ratings yet
Pandas
27 pages
Pandas DataFrame1
No ratings yet
Pandas DataFrame1
22 pages
Pandas - Digitalocean
No ratings yet
Pandas - Digitalocean
15 pages
Pandas
No ratings yet
Pandas
5 pages
Lab 1 ML Lab
No ratings yet
Lab 1 ML Lab
15 pages
DataFrame Ac Win Final
No ratings yet
DataFrame Ac Win Final
30 pages
Web Attacks Notes
No ratings yet
Web Attacks Notes
11 pages
Data Science Notes Unit-1 Part - 2
No ratings yet
Data Science Notes Unit-1 Part - 2
22 pages
Pandas PDF
No ratings yet
Pandas PDF
25 pages
CC Unit - 03
No ratings yet
CC Unit - 03
10 pages
Class Notes: Class: XII Date: 7-Apr-2020 Subject: Informatics Practices Topic: 2. Python Pandas
No ratings yet
Class Notes: Class: XII Date: 7-Apr-2020 Subject: Informatics Practices Topic: 2. Python Pandas
4 pages
Question 2
No ratings yet
Question 2
9 pages
Lab 9
No ratings yet
Lab 9
9 pages
Pandas
No ratings yet
Pandas
4 pages
Integrasi Data Dan ETL
No ratings yet
Integrasi Data Dan ETL
45 pages
IP 12th Chapter 3
No ratings yet
IP 12th Chapter 3
9 pages
Ip Front Page
No ratings yet
Ip Front Page
6 pages
Ainotes Dataframe
No ratings yet
Ainotes Dataframe
5 pages
SQL Server Interview Questions - Self Join With An Example
No ratings yet
SQL Server Interview Questions - Self Join With An Example
5 pages
Ainotes
No ratings yet
Ainotes
5 pages
Learn Data Analysis With Pandas - Introduction
No ratings yet
Learn Data Analysis With Pandas - Introduction
2 pages
The Essential R Reference
From Everand
The Essential R Reference
Mark Gardener
No ratings yet

05 Pandas Data Frames

Uploaded by

05 Pandas Data Frames

Uploaded by

Pandas Data Frames Notes:

A Pandas DataFrame is a 2-dimensional, size-mutable, and potentially heterogeneous tabular data

Example of creating a DataFrame:

Each column in a DataFrame is a Series

Name: Age, dtype: int64

When selecting a single column of a pandas DataFrame, the result is a pandas

2. From a List of Lists (or Tuples)

3. From a List of Dictionaries

4. From a Numpy Array

data = np.array([['Jack', 24, 'Pune'], ['Bob', 27, 'Jaipur'], ['Tom', 22,

5. From CSV or Excel Files

# Converting series to DataFrame

8. From a Dictionary of Series

9. From a Tuple of Tuples (Multi-Index DataFrame)

10. From a JSON File

11. From a SQL Query

Example: Creating and Manipulating a DataFrame

# Step 1: Create a DataFrame

# Create DataFrame from the dictionary

# Step 2: Display the DataFrame

# Step 3: Access a column

# Step 4: Filter rows based on a condition

# Step 6: Display the updated DataFrame

# Step 7: Select specific rows and columns

People older than 23:

Updated DataFrame with Country column:

Select 'Name' and 'City' for people older than 23:

# **Deleting a single column**

# **Deleting multiple columns**

After Deleting 'Salary' Column:

After Deleting 'Age' Column:

# **Deleting a single row by index**

# **Deleting multiple rows**

After Deleting Rows with Index 0 and 2:

3. Deleting Rows Based on a Condition

# **Delete rows where Age is greater than 30**

1. Renaming Column Labels

Example: Renaming Columns

After Renaming Columns:

2. Renaming Row Labels (Index Values)

Example: Renaming Index

3. Renaming Both Columns and Index Together

Alternative: Using .columns and .index Directly

A DataFrame is a two-dimensional, mutable data structure in pandas, similar to an Excel

# Creating a sample DataFrame

# Display the DataFrame

# Using different attributes

# Transposing the DataFrame

1. head(n) – Display First n Rows

# Creating a sample DataFrame

# Display first 3 rows

2. tail(n) – Display Last n Rows

Helps verify the last few records in the dataset.

3. info() – Summary of DataFrame

Helps identify data types, missing values, and memory usage.

4. describe() – Statistical Summary

Helps understand data distribution and identify outliers.

# Step 1: Create Sample Employee Data

# Create a DataFrame from the dictionary

# Step 2: Display First Few Rows Using head()

# Step 3: Display Last Few Rows Using tail()

# Step 4: Get Dataset Summary Using info()

# Step 5: Get Summary Statistics Using describe()

1. Merging DataFrames (merge())

Example: Merge Using a Common Column

# Creating first DataFrame

# Creating second DataFrame

# Merging on 'Emp_ID' (Inner Join by default)

Different Types of Joins in merge()

2. Joining DataFrames (join())

Example: Join Using Index

# Creating second DataFrame with index

# Joining DataFrames on index

1. Inner Join (Default in merge())

df1 = pd.DataFrame({'ID': [1, 2, 3], 'Name': ['Alice', 'Bob', 'Charlie']})

inner_join = pd.merge(df1, df2, on='ID', how='inner')

# Deleting a single column

# Deleting multiple columns

# Deleting a single row by index

# Deleting multiple rows

# Delete rows where Age is greater than 30