0% found this document useful (0 votes)
2 views8 pages

Data Mining - Week - 4

The document covers handling missing data, combining datasets, and aggregation in Pandas. It explains methods for identifying, dropping, and imputing missing values, as well as techniques for concatenating and appending DataFrames. Additionally, it discusses grouping data using the groupby function and creating pivot tables for data summarization and analysis.

Uploaded by

nghiemhoa4895
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
2 views8 pages

Data Mining - Week - 4

The document covers handling missing data, combining datasets, and aggregation in Pandas. It explains methods for identifying, dropping, and imputing missing values, as well as techniques for concatenating and appending DataFrames. Additionally, it discusses grouping data using the groupby function and creating pivot tables for data summarization and analysis.

Uploaded by

nghiemhoa4895
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 8

Week 4

Handling Missing Data, Data Combining, and Aggregation in Pandas Lecture


Notes
I. Handling Missing Data: Operations on Null Values
Handling missing data is crucial to ensure data quality and the reliability of
analysis. Missing values can lead to biased results or reduce the performance of
machine learning models. Pandas offers several methods to detect, handle, and
impute missing values effectively.

1. Identifying Missing Values

: Identifies missing values and returns a DataFrame of the same


isnull()

shape with True for missing values.

import pandas as pd
df = pd.read_csv('data.csv')
missing_values = df.isnull()
print(missing_values)

sum() : Get the count of missing values for each column.

Week 4 1
missing_count = df.isnull().sum()
print(missing_count)

any() : Check if any value is missing in a column.

missing_any = df.isnull().any()
print(missing_any)

Visualization: Use heatmaps to visualize missing values.

import seaborn as sns


import matplotlib.pyplot as plt
sns.heatmap(df.isnull(), cbar=False, cmap='viridis')
plt.title('Missing Values Heatmap')
plt.show()

2. Dropping Missing Values

: Removes rows or columns with missing values. This is typically


dropna()

used when the amount of missing data is small and will not significantly
impact the dataset.

Drop Rows with Missing Values:

df_cleaned = df.dropna()

Drop Columns with Missing Values:

df_cleaned = df.dropna(axis=1)

Drop Rows with Missing Values in Specific Columns:

df_cleaned = df.dropna(subset=['Age', 'Salary'])

3. Imputing Missing Values

Week 4 2
Fill with Mean, Median, or Mode: Imputing missing values is commonly
done to retain all rows in the dataset while filling the gaps.

Mean: Fill missing numerical values with the column mean.

df['Age'].fillna(df['Age'].mean(), inplace=True)

Median: Fill missing values with the median.

df['Age'].fillna(df['Age'].median(), inplace=True)

Mode: Fill missing values with the mode (most frequent value).

df['Gender'].fillna(df['Gender'].mode()[0], inplace=
True)

Forward Fill and Backward Fill: Useful for time series data where the
assumption is that values remain constant until a change occurs.

Forward Fill ( ffill ): Fill missing values using the previous row's value.

df.fillna(method='ffill', inplace=True)

Backward Fill ( bfill ): Fill missing values using the next row's value.

df.fillna(method='bfill', inplace=True)

Custom Imputation: Replace missing values with a specific constant or


custom value based on domain knowledge.

df['Salary'].fillna(50000, inplace=True) # Fill missin


g salaries with a constant value

II. Combining Datasets: Concat and Append

Combining datasets is often necessary when working with multiple data sources
or when you need to add new data to an existing dataset. Pandas provides

Week 4 3
convenient methods such as concat() and append() to combine DataFrames.

1. Concatenation ( concat() )

Vertical Concatenation (stack DataFrames on top of each other). Useful


when you have multiple datasets with the same structure (i.e., same
columns).

df1 = pd.DataFrame({'ID': [1, 2, 3], 'Name': ['Alice',


'Bob', 'Charlie']})
df2 = pd.DataFrame({'ID': [4, 5, 6], 'Name': ['David',
'Ella', 'Frank']})
df_combined = pd.concat([df1, df2], axis=0)
print(df_combined)

Horizontal Concatenation (combine columns). Useful when the datasets


share an index or when you want to add additional information.

df3 = pd.DataFrame({'Age': [25, 30, 35]})


df_combined_horiz = pd.concat([df1, df3], axis=1)
print(df_combined_horiz)

Concatenating with Keys: Add hierarchical keys to identify which original


DataFrame the rows came from. This is useful when you want to maintain
information about the source of each row.

df_concat_keys = pd.concat([df1, df2], keys=['Group1',


'Group2'])
print(df_concat_keys)

Ignoring Index: Reset the index when concatenating.

df_combined_reset = pd.concat([df1, df2], ignore_index=


True)

2. Appending Datasets ( append() )

Week 4 4
Appending Rows: Use append() to add rows from another DataFrame or
Series. This is similar to vertical concatenation.

df_appended = df1.append(df2, ignore_index=True)


print(df_appended)

Appending Series: Append a single Series (like a new row) to a


DataFrame.

new_row = pd.Series({'ID': 7, 'Name': 'George'})


df_appended_series = df1.append(new_row, ignore_index=T
rue)
print(df_appended_series)

Deprecated Notice: append() is deprecated from Pandas version 1.4


onwards. Prefer using concat() instead for future compatibility.

III. Aggregation and Grouping: Groupby Functions

Grouping and aggregation are essential techniques to perform operations on


subsets of data, such as computing averages, sums, or counts. The groupby()
function in Pandas provides a powerful way to split data, apply functions, and
combine results.

1. Basic Grouping

Grouping by a Column: Use groupby() to group data by specific columns,


allowing you to aggregate data for each group.

df = pd.DataFrame({'Department': ['HR', 'IT', 'HR', 'I


T', 'Finance'],
'Salary': [60000, 80000, 62000, 8500
0, 75000]})
grouped = df.groupby('Department')

2. Aggregation

Week 4 5
Aggregating with Built-in Functions: Apply aggregation functions like
mean() , sum() , count() , etc., on grouped data to derive insights.

mean_salary = grouped['Salary'].mean()
print(mean_salary)

Custom Aggregation: Use agg() to apply multiple aggregation functions to


grouped data.

aggregated = grouped['Salary'].agg(['mean', 'sum', 'mi


n', 'max'])
print(aggregated)

Renaming Aggregation Columns: Rename the aggregated columns for


clarity.

aggregated = grouped['Salary'].agg(mean_salary='mean',
total_salary='sum')
print(aggregated)

3. Grouping by Multiple Columns

Multi-Level Grouping: Group by more than one column to explore detailed


breakdowns, such as by department and location.

df = pd.DataFrame({'Department': ['HR', 'IT', 'HR', 'I


T', 'Finance'],
'Location': ['NY', 'SF', 'NY', 'SF',
'LA'],
'Salary': [60000, 80000, 62000, 8500
0, 75000]})
grouped_multi = df.groupby(['Department', 'Location'])
['Salary'].mean()
print(grouped_multi)

4. Iterating Over Groups

Week 4 6
Iterate over groups to process each group independently. This can be
helpful when different processing is required for each group.

for name, group in grouped:


print(f"Department: {name}")
print(group)

IV. Pivot Tables: Use Cases and Examples

Pivot tables are used to summarize and aggregate data in a flexible way, similar to
Excel pivot tables. They allow us to restructure data and gain insights by breaking
down numerical data into meaningful summaries.

1. Creating Pivot Tables

Basic Pivot Table: Create a pivot table using pivot_table() . You can
summarize values by specifying index , columns , and aggfunc .

df = pd.DataFrame({'Region': ['East', 'West', 'East',


'West', 'East'],
'Product': ['A', 'A', 'B', 'B',
'A'],
'Sales': [100, 150, 200, 300, 120]})
pivot = df.pivot_table(values='Sales', index='Region',
columns='Product', aggfunc='sum', fill_value=0)
print(pivot)

Multiple Aggregation Functions: Apply multiple aggregation functions to


summarize data in different ways.

pivot_multi_agg = df.pivot_table(values='Sales', index


='Region', columns='Product', aggfunc=['sum', 'mean'],
fill_value=0)
print(pivot_multi_agg)

2. Use Cases of Pivot Tables

Week 4 7
Sales Analysis: Pivot tables are commonly used in sales analysis to
understand performance across different regions, products, or time
periods.

Example: Calculate total sales for each region and each product to
identify top-performing products and regions.

Human Resources: Analyze employee count or average salary by


department and location.

Example: Calculate the average salary per department to determine


compensation trends or analyze the distribution of employees across
locations.

Financial Reporting: Summarize financial data by quarter, year, or product


type for reporting purposes.

Example: Calculate quarterly sales to compare seasonal performance


and track yearly growth.

3. Adding Margins

Margins: Use margins=True to add row and column totals to pivot tables for
a comprehensive view.

pivot_with_totals = df.pivot_table(values='Sales', inde


x='Region', columns='Product', aggfunc='sum', fill_valu
e=0, margins=True)
print(pivot_with_totals)

Adding Custom Totals: Customize the margin values by renaming them.

pivot_custom_margins = df.pivot_table(values='Sales', i
ndex='Region', columns='Product', aggfunc='sum', fill_v
alue=0, margins=True, margins_name='Total Sales')
print(pivot_custom_margins)

Week 4 8

You might also like