0% found this document useful (0 votes)

2 views8 pages

Data Mining - Week - 4

The document covers handling missing data, combining datasets, and aggregation in Pandas. It explains methods for identifying, dropping, and imputing missing values, as well as techniques for concatenating and appending DataFrames. Additionally, it discusses grouping data using the groupby function and creating pivot tables for data summarization and analysis.

Uploaded by

nghiemhoa4895

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

2 views8 pages

Data Mining - Week - 4

Uploaded by

nghiemhoa4895

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 8

Week 4

Handling Missing Data, Data Combining, and Aggregation in Pandas Lecture

Notes
I. Handling Missing Data: Operations on Null Values
Handling missing data is crucial to ensure data quality and the reliability of
analysis. Missing values can lead to biased results or reduce the performance of
machine learning models. Pandas offers several methods to detect, handle, and
impute missing values effectively.

1. Identifying Missing Values

: Identifies missing values and returns a DataFrame of the same

isnull()

shape with True for missing values.

import pandas as pd
df = pd.read_csv('data.csv')
missing_values = df.isnull()
print(missing_values)

sum() : Get the count of missing values for each column.

Week 4 1
missing_count = df.isnull().sum()
print(missing_count)

any() : Check if any value is missing in a column.

missing_any = df.isnull().any()
print(missing_any)

Visualization: Use heatmaps to visualize missing values.

import seaborn as sns

import matplotlib.pyplot as plt
sns.heatmap(df.isnull(), cbar=False, cmap='viridis')
plt.title('Missing Values Heatmap')
plt.show()

2. Dropping Missing Values

: Removes rows or columns with missing values. This is typically

dropna()

used when the amount of missing data is small and will not significantly
impact the dataset.

Drop Rows with Missing Values:

df_cleaned = df.dropna()

Drop Columns with Missing Values:

df_cleaned = df.dropna(axis=1)

Drop Rows with Missing Values in Specific Columns:

df_cleaned = df.dropna(subset=['Age', 'Salary'])

3. Imputing Missing Values

Week 4 2
Fill with Mean, Median, or Mode: Imputing missing values is commonly
done to retain all rows in the dataset while filling the gaps.

Mean: Fill missing numerical values with the column mean.

df['Age'].fillna(df['Age'].mean(), inplace=True)

Median: Fill missing values with the median.

df['Age'].fillna(df['Age'].median(), inplace=True)

Mode: Fill missing values with the mode (most frequent value).

df['Gender'].fillna(df['Gender'].mode()[0], inplace=
True)

Forward Fill and Backward Fill: Useful for time series data where the
assumption is that values remain constant until a change occurs.

Forward Fill ( ffill ): Fill missing values using the previous row's value.

df.fillna(method='ffill', inplace=True)

Backward Fill ( bfill ): Fill missing values using the next row's value.

df.fillna(method='bfill', inplace=True)

Custom Imputation: Replace missing values with a specific constant or

custom value based on domain knowledge.

df['Salary'].fillna(50000, inplace=True) # Fill missin

g salaries with a constant value

II. Combining Datasets: Concat and Append

Combining datasets is often necessary when working with multiple data sources
or when you need to add new data to an existing dataset. Pandas provides

Week 4 3
convenient methods such as concat() and append() to combine DataFrames.

1. Concatenation ( concat() )

Vertical Concatenation (stack DataFrames on top of each other). Useful

when you have multiple datasets with the same structure (i.e., same
columns).

df1 = pd.DataFrame({'ID': [1, 2, 3], 'Name': ['Alice',

'Bob', 'Charlie']})
df2 = pd.DataFrame({'ID': [4, 5, 6], 'Name': ['David',
'Ella', 'Frank']})
df_combined = pd.concat([df1, df2], axis=0)
print(df_combined)

Horizontal Concatenation (combine columns). Useful when the datasets

share an index or when you want to add additional information.

df3 = pd.DataFrame({'Age': [25, 30, 35]})

df_combined_horiz = pd.concat([df1, df3], axis=1)
print(df_combined_horiz)

Concatenating with Keys: Add hierarchical keys to identify which original

DataFrame the rows came from. This is useful when you want to maintain
information about the source of each row.

df_concat_keys = pd.concat([df1, df2], keys=['Group1',

'Group2'])
print(df_concat_keys)

Ignoring Index: Reset the index when concatenating.

df_combined_reset = pd.concat([df1, df2], ignore_index=

True)

2. Appending Datasets ( append() )

Week 4 4
Appending Rows: Use append() to add rows from another DataFrame or
Series. This is similar to vertical concatenation.

df_appended = df1.append(df2, ignore_index=True)

print(df_appended)

Appending Series: Append a single Series (like a new row) to a

DataFrame.

new_row = pd.Series({'ID': 7, 'Name': 'George'})

df_appended_series = df1.append(new_row, ignore_index=T
rue)
print(df_appended_series)

Deprecated Notice: append() is deprecated from Pandas version 1.4

onwards. Prefer using concat() instead for future compatibility.

III. Aggregation and Grouping: Groupby Functions

Grouping and aggregation are essential techniques to perform operations on

subsets of data, such as computing averages, sums, or counts. The groupby()
function in Pandas provides a powerful way to split data, apply functions, and
combine results.

1. Basic Grouping

Grouping by a Column: Use groupby() to group data by specific columns,

allowing you to aggregate data for each group.

df = pd.DataFrame({'Department': ['HR', 'IT', 'HR', 'I

T', 'Finance'],
'Salary': [60000, 80000, 62000, 8500
0, 75000]})
grouped = df.groupby('Department')

2. Aggregation

Week 4 5
Aggregating with Built-in Functions: Apply aggregation functions like
mean() , sum() , count() , etc., on grouped data to derive insights.

mean_salary = grouped['Salary'].mean()
print(mean_salary)

Custom Aggregation: Use agg() to apply multiple aggregation functions to

grouped data.

aggregated = grouped['Salary'].agg(['mean', 'sum', 'mi

n', 'max'])
print(aggregated)

Renaming Aggregation Columns: Rename the aggregated columns for

clarity.

aggregated = grouped['Salary'].agg(mean_salary='mean',
total_salary='sum')
print(aggregated)

3. Grouping by Multiple Columns

Multi-Level Grouping: Group by more than one column to explore detailed

breakdowns, such as by department and location.

df = pd.DataFrame({'Department': ['HR', 'IT', 'HR', 'I

T', 'Finance'],
'Location': ['NY', 'SF', 'NY', 'SF',
'LA'],
'Salary': [60000, 80000, 62000, 8500
0, 75000]})
grouped_multi = df.groupby(['Department', 'Location'])
['Salary'].mean()
print(grouped_multi)

4. Iterating Over Groups

Week 4 6
Iterate over groups to process each group independently. This can be
helpful when different processing is required for each group.

for name, group in grouped:

print(f"Department: {name}")
print(group)

IV. Pivot Tables: Use Cases and Examples

Pivot tables are used to summarize and aggregate data in a flexible way, similar to
Excel pivot tables. They allow us to restructure data and gain insights by breaking
down numerical data into meaningful summaries.

1. Creating Pivot Tables

Basic Pivot Table: Create a pivot table using pivot_table() . You can
summarize values by specifying index , columns , and aggfunc .

df = pd.DataFrame({'Region': ['East', 'West', 'East',

'West', 'East'],
'Product': ['A', 'A', 'B', 'B',
'A'],
'Sales': [100, 150, 200, 300, 120]})
pivot = df.pivot_table(values='Sales', index='Region',
columns='Product', aggfunc='sum', fill_value=0)
print(pivot)

Multiple Aggregation Functions: Apply multiple aggregation functions to

summarize data in different ways.

pivot_multi_agg = df.pivot_table(values='Sales', index

='Region', columns='Product', aggfunc=['sum', 'mean'],
fill_value=0)
print(pivot_multi_agg)

2. Use Cases of Pivot Tables

Week 4 7
Sales Analysis: Pivot tables are commonly used in sales analysis to
understand performance across different regions, products, or time
periods.

Example: Calculate total sales for each region and each product to
identify top-performing products and regions.

Human Resources: Analyze employee count or average salary by

department and location.

Example: Calculate the average salary per department to determine

compensation trends or analyze the distribution of employees across
locations.

Financial Reporting: Summarize financial data by quarter, year, or product

type for reporting purposes.

Example: Calculate quarterly sales to compare seasonal performance

and track yearly growth.

3. Adding Margins

Margins: Use margins=True to add row and column totals to pivot tables for
a comprehensive view.

pivot_with_totals = df.pivot_table(values='Sales', inde

x='Region', columns='Product', aggfunc='sum', fill_valu
e=0, margins=True)
print(pivot_with_totals)

Adding Custom Totals: Customize the margin values by renaming them.

pivot_custom_margins = df.pivot_table(values='Sales', i
ndex='Region', columns='Product', aggfunc='sum', fill_v
alue=0, margins=True, margins_name='Total Sales')
print(pivot_custom_margins)

Week 4 8

Historical Development of Social Work
100% (3)
Historical Development of Social Work
18 pages
Physical Education 12 Fitt Goals
No ratings yet
Physical Education 12 Fitt Goals
6 pages
Stages of Language Development
No ratings yet
Stages of Language Development
14 pages
CH-6 Data Loading, Storage, and File Formats
No ratings yet
CH-6 Data Loading, Storage, and File Formats
163 pages
EPAS - Training Plan Basic Competency - MTP
No ratings yet
EPAS - Training Plan Basic Competency - MTP
5 pages
Python Interviews
No ratings yet
Python Interviews
154 pages
Pandas
No ratings yet
Pandas
13 pages
Syllabus AT2 Automotive Electrical System 1.0
No ratings yet
Syllabus AT2 Automotive Electrical System 1.0
13 pages
Pandas
No ratings yet
Pandas
30 pages
jBASE INTRO21
No ratings yet
jBASE INTRO21
28 pages
12 Pandas
100% (1)
12 Pandas
21 pages
Pandas
No ratings yet
Pandas
94 pages
Justenoughpython Pandas 220915 175329
No ratings yet
Justenoughpython Pandas 220915 175329
64 pages
Usage of NumPy For Numerical Data in Detail
No ratings yet
Usage of NumPy For Numerical Data in Detail
52 pages
DAP 3 Module
No ratings yet
DAP 3 Module
62 pages
Chapter-2 Python Pandas
100% (2)
Chapter-2 Python Pandas
33 pages
07 Data Wrangling
No ratings yet
07 Data Wrangling
51 pages
DevOps Session 3 Pandas
No ratings yet
DevOps Session 3 Pandas
33 pages
Pandas
No ratings yet
Pandas
26 pages
Pandas Introduction: What Is Python Pandas Used For?
No ratings yet
Pandas Introduction: What Is Python Pandas Used For?
28 pages
Pandas
No ratings yet
Pandas
25 pages
VM Notes
No ratings yet
VM Notes
4 pages
Data Handling Part Ii
No ratings yet
Data Handling Part Ii
41 pages
Pandas Module (Part-I)
No ratings yet
Pandas Module (Part-I)
36 pages
Pandas 1
No ratings yet
Pandas 1
50 pages
Introduction To Pandas
No ratings yet
Introduction To Pandas
27 pages
Python Programming For Data Science
No ratings yet
Python Programming For Data Science
36 pages
Cheat Sheet
No ratings yet
Cheat Sheet
15 pages
Rajni Ip File Final
No ratings yet
Rajni Ip File Final
42 pages
Pandas DataFrame Notes
100% (1)
Pandas DataFrame Notes
10 pages
Pandas CheatSheet
No ratings yet
Pandas CheatSheet
18 pages
04-Data Manipulation With Pandas
No ratings yet
04-Data Manipulation With Pandas
28 pages
Pandas DataFrame Notes
No ratings yet
Pandas DataFrame Notes
10 pages
Exp3 Python
No ratings yet
Exp3 Python
15 pages
Updated FAQs April 2019 04 Final
80% (5)
Updated FAQs April 2019 04 Final
22 pages
Dataframing in CSV
No ratings yet
Dataframing in CSV
14 pages
Python-for-Data-Analysis (Pandas
No ratings yet
Python-for-Data-Analysis (Pandas
31 pages
Exp 3
No ratings yet
Exp 3
10 pages
Intro Pandas
No ratings yet
Intro Pandas
18 pages
Cheat Sheet: The Pandas Dataframe Object: Preliminaries Get Your Data Into A Dataframe
100% (1)
Cheat Sheet: The Pandas Dataframe Object: Preliminaries Get Your Data Into A Dataframe
12 pages
Pandas Cheat Sheet
100% (2)
Pandas Cheat Sheet
6 pages
Introduction To Pandas in Data Analytics
No ratings yet
Introduction To Pandas in Data Analytics
12 pages
Python CSBS Bhavya Lab Manual
No ratings yet
Python CSBS Bhavya Lab Manual
14 pages
Code Explanation For Date Types
No ratings yet
Code Explanation For Date Types
8 pages
Data Handling Module
No ratings yet
Data Handling Module
10 pages
Pandas+With+Python+ +DATAhill+Solutions
No ratings yet
Pandas+With+Python+ +DATAhill+Solutions
24 pages
Informatics Practices Practical File
No ratings yet
Informatics Practices Practical File
8 pages
Pandas Cheat Sheet
No ratings yet
Pandas Cheat Sheet
5 pages
Practical File IP Class 12 2024 25 Sharing Removed
No ratings yet
Practical File IP Class 12 2024 25 Sharing Removed
29 pages
Content Pandas Cheat Sheet
No ratings yet
Content Pandas Cheat Sheet
9 pages
12 Useful Pandas Techniques in Python For Data Manipulation
100% (2)
12 Useful Pandas Techniques in Python For Data Manipulation
19 pages
Pandas DataFrame Notes
No ratings yet
Pandas DataFrame Notes
13 pages
Pandas Dataframe All Operations 1735471870
No ratings yet
Pandas Dataframe All Operations 1735471870
4 pages
Python 2.1.3
No ratings yet
Python 2.1.3
6 pages
Pandas
No ratings yet
Pandas
5 pages
Rosca Ciprian Application Form EB 11-12
No ratings yet
Rosca Ciprian Application Form EB 11-12
20 pages
Pandas Merged
No ratings yet
Pandas Merged
2 pages
Introduction To Pandas Programming 2
No ratings yet
Introduction To Pandas Programming 2
3 pages
Pandas Notes
No ratings yet
Pandas Notes
3 pages
Cheat Sheet: The Pandas Dataframe Object I: Preliminaries Get Your Data Into A Dataframe
No ratings yet
Cheat Sheet: The Pandas Dataframe Object I: Preliminaries Get Your Data Into A Dataframe
12 pages
North Eastern Institute of Ayurveda & Homoeopathy
No ratings yet
North Eastern Institute of Ayurveda & Homoeopathy
8 pages
Lecture 14
No ratings yet
Lecture 14
33 pages
3Y3Z2Xzqn7 U Y%K : 2. How To Create A Data Frame Using A Dictionary of Pre-Existing Columns or Numpy 2D Arrays?
No ratings yet
3Y3Z2Xzqn7 U Y%K : 2. How To Create A Data Frame Using A Dictionary of Pre-Existing Columns or Numpy 2D Arrays?
8 pages
Data Wrangling With Python and Pandas
No ratings yet
Data Wrangling With Python and Pandas
7 pages
Python Cheat Sheet Code Academy
100% (1)
Python Cheat Sheet Code Academy
1 page
Data Science Cheat Sheet: KEY Imports
100% (1)
Data Science Cheat Sheet: KEY Imports
1 page
On Advancing Beyond School Grade Level
No ratings yet
On Advancing Beyond School Grade Level
3 pages
Fences Unit Plan
No ratings yet
Fences Unit Plan
2 pages
Teenage Pregnancy
No ratings yet
Teenage Pregnancy
5 pages
SOP Report Card Comment
No ratings yet
SOP Report Card Comment
1 page
Territory Planning
No ratings yet
Territory Planning
2 pages
Lesson Plan: Knowledge
No ratings yet
Lesson Plan: Knowledge
5 pages
Flower and Hayes Model Writing
No ratings yet
Flower and Hayes Model Writing
1 page
Hall Ticket V-Sem Vii Novdec 2024
No ratings yet
Hall Ticket V-Sem Vii Novdec 2024
235 pages
Lahari Dhanasi-1
No ratings yet
Lahari Dhanasi-1
2 pages
Notice 3
No ratings yet
Notice 3
1 page
Library Management Ssytem
No ratings yet
Library Management Ssytem
3 pages
Tapak Forensik Kbat
No ratings yet
Tapak Forensik Kbat
12 pages
Ib Philosophy Coursework Mark Scheme
100% (2)
Ib Philosophy Coursework Mark Scheme
8 pages
Applicant Questionnaire Form
No ratings yet
Applicant Questionnaire Form
1 page
Debt Crises and Financial Crises Syllabus S24
No ratings yet
Debt Crises and Financial Crises Syllabus S24
5 pages
My Attachment Report
No ratings yet
My Attachment Report
3 pages
KPEF F24 Sp25 07022025
No ratings yet
KPEF F24 Sp25 07022025
2 pages
Ministry of Education
No ratings yet
Ministry of Education
3 pages
The Power of Alignment
No ratings yet
The Power of Alignment
3 pages
Econ103 Lecture1 Week1 W25
No ratings yet
Econ103 Lecture1 Week1 W25
30 pages
Advanced C Concepts and Programming: First Edition
From Everand
Advanced C Concepts and Programming: First Edition
Gayatri
3/5 (1)
The Essential R Reference
From Everand
The Essential R Reference
Mark Gardener
No ratings yet
Introduction to PHP, Part 2, Second Edition
From Everand
Introduction to PHP, Part 2, Second Edition
Adam Majczak
No ratings yet
Profound Python Data Science
From Everand
Profound Python Data Science
Onder Teker
No ratings yet

Data Mining - Week - 4

Uploaded by

Data Mining - Week - 4

Uploaded by

Week 4

Handling Missing Data, Data Combining, and Aggregation in Pandas Lecture

1. Identifying Missing Values

: Identifies missing values and returns a DataFrame of the same

shape with True for missing values.

sum() : Get the count of missing values for each column.

any() : Check if any value is missing in a column.

Visualization: Use heatmaps to visualize missing values.

import seaborn as sns

2. Dropping Missing Values

: Removes rows or columns with missing values. This is typically

Drop Rows with Missing Values:

Drop Columns with Missing Values:

Drop Rows with Missing Values in Specific Columns:

df_cleaned = df.dropna(subset=['Age', 'Salary'])

3. Imputing Missing Values

Mean: Fill missing numerical values with the column mean.

Median: Fill missing values with the median.

Custom Imputation: Replace missing values with a specific constant or

df['Salary'].fillna(50000, inplace=True) # Fill missin

II. Combining Datasets: Concat and Append

Vertical Concatenation (stack DataFrames on top of each other). Useful

df1 = pd.DataFrame({'ID': [1, 2, 3], 'Name': ['Alice',

Horizontal Concatenation (combine columns). Useful when the datasets

df3 = pd.DataFrame({'Age': [25, 30, 35]})

Concatenating with Keys: Add hierarchical keys to identify which original

df_concat_keys = pd.concat([df1, df2], keys=['Group1',

Ignoring Index: Reset the index when concatenating.

df_combined_reset = pd.concat([df1, df2], ignore_index=

2. Appending Datasets ( append() )

df_appended = df1.append(df2, ignore_index=True)

Appending Series: Append a single Series (like a new row) to a

new_row = pd.Series({'ID': 7, 'Name': 'George'})

Deprecated Notice: append() is deprecated from Pandas version 1.4

III. Aggregation and Grouping: Groupby Functions

Grouping and aggregation are essential techniques to perform operations on

Grouping by a Column: Use groupby() to group data by specific columns,

df = pd.DataFrame({'Department': ['HR', 'IT', 'HR', 'I

Custom Aggregation: Use agg() to apply multiple aggregation functions to

aggregated = grouped['Salary'].agg(['mean', 'sum', 'mi

Renaming Aggregation Columns: Rename the aggregated columns for

3. Grouping by Multiple Columns

Multi-Level Grouping: Group by more than one column to explore detailed

df = pd.DataFrame({'Department': ['HR', 'IT', 'HR', 'I

4. Iterating Over Groups

for name, group in grouped:

IV. Pivot Tables: Use Cases and Examples

1. Creating Pivot Tables

df = pd.DataFrame({'Region': ['East', 'West', 'East',

Multiple Aggregation Functions: Apply multiple aggregation functions to

pivot_multi_agg = df.pivot_table(values='Sales', index

2. Use Cases of Pivot Tables

Human Resources: Analyze employee count or average salary by

Example: Calculate the average salary per department to determine

Financial Reporting: Summarize financial data by quarter, year, or product

Example: Calculate quarterly sales to compare seasonal performance

pivot_with_totals = df.pivot_table(values='Sales', inde

Adding Custom Totals: Customize the margin values by renaming them.

You might also like