0% found this document useful (0 votes)
6 views4 pages

ADS Exp2

The document outlines the importance of data preprocessing, specifically focusing on data imputation techniques to handle missing values in datasets. It describes three common imputation methods: mean, median, and mode, along with their applications in various fields such as healthcare, finance, and marketing. A Python program is provided to demonstrate the implementation of these techniques using a sample dataset.

Uploaded by

om29khatri
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
6 views4 pages

ADS Exp2

The document outlines the importance of data preprocessing, specifically focusing on data imputation techniques to handle missing values in datasets. It describes three common imputation methods: mean, median, and mode, along with their applications in various fields such as healthcare, finance, and marketing. A Python program is provided to demonstrate the implementation of these techniques using a sample dataset.

Uploaded by

om29khatri
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 4

Roll no:

Date:
EXPERIMENT NO.:02

Aim: Data Preprocessing using Data Imputation.


Theory:
Data preprocessing is an important step in data science transforming raw data into a clean
structured format for analysis. It involves tasks like handling missing values, normalizing data and
encoding variables. Mastering preprocessing in Python ensures reliable insights for accurate predictions
and effective decision-making. Pre-processing refers to the transformations applied to data before feeding
it to the algorithm.
Data imputation is the process of replacing missing or incomplete data in a dataset with
substituted values to ensure the dataset remains useful for analysis or modeling. Missing data can occur
due to errors during data collection, system failures, or other reasons, and handling it appropriately is
crucial to maintain the integrity of data analysis.

There are three common techniques for imputation based on statistical measures:

1.​ Mean Imputation:


Missing values are replaced with the average of the non-missing values in the column.
○​ Best for numerical data without outliers.
○​ Example: For a column [2, 4, NaN, 6], mean = (2+4+6)/3 = 4, so replace NaN with 4.
2.​ Median Imputation:
Missing values are replaced with the median of the non-missing values.
○​ Suitable for numerical data with outliers, as the median is less affected by extreme values.
○​ Example: For a column [1, 2, NaN, 100], median = 2, so replace NaN with 2.
3.​ Mode Imputation:
Missing values are replaced with the most frequently occurring value (mode) in the column.
○​ Works well for categorical data or numerical data with repeated values.
○​ Example: For a column [A, B, NaN, A, C], mode = A, so replace NaN with A.

Applications of Data Imputation:

1.​ Healthcare: Fill missing patient info (e.g., age, blood pressure) for disease prediction.
2.​ E-commerce: Handle gaps in sales, ratings, or user data for recommendations.
3.​ Finance: Impute missing stock prices or credit scores for financial models.
4.​ Education: Replace missing test scores or attendance for performance analysis.
5.​ Marketing: Fill gaps in customer demographics for targeted ads.
6.​ Real Estate: Address missing property details for price prediction.
7.​ Social Media: Handle incomplete engagement data (e.g., likes, shares) for trend analysis.
8.​ Logistics: Fill gaps in vehicle mileage or delivery times for optimization.
9.​ Big Data: Clean large datasets for analytics and trend prediction.
Program:

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.impute import SimpleImputer

# Load the dataset from the specified path


file_path = r"C:\Users\RGIT\Desktop\A806\Diabetes Missing Data.csv"
df = pd.read_csv(file_path)

# Show the first few rows of the dataset to understand its structure
print("Original Dataset:")
print(df.head())

# Introduce missing values randomly (10% of data will be missing for demonstration)
np.random.seed(42)
missing_mask = np.random.rand(*df.shape) < 0.1 # 10% missing data
df_missing = df.copy()
df_missing = df_missing.mask(missing_mask)

# Show the data with missing values


print("\nData with Missing Values:")
print(df_missing.head())

# Separate numeric and non-numeric columns


numeric_cols = df_missing.select_dtypes(include=[np.number]).columns
non_numeric_cols = df_missing.select_dtypes(exclude=[np.number]).columns

# Impute missing values for numeric columns using Mean, Median, and Mode
# 1. Mean Imputation
mean_imputer = SimpleImputer(strategy='mean')
df_mean_imputed=pd.DataFrame(mean_imputer.fit_transform(df_missing[numeric_cols]),
columns=numeric_cols)
df_mean_imputed[non_numeric_cols] = df_missing[non_numeric_cols].reset_index(drop=True)

# 2. Median Imputation
median_imputer = SimpleImputer(strategy='median')
df_median_imputed=pd.DataFrame(median_imputer.fit_transform(df_missing[numeric_cols]),
columns=numeric_cols)
df_median_imputed[non_numeric_cols] = df_missing[non_numeric_cols].reset_index(drop=True)

# 3. Mode Imputation
mode_imputer = SimpleImputer(strategy='most_frequent')
df_mode_imputed=pd.DataFrame(mode_imputer.fit_transform(df_missing[numeric_cols]),
columns=numeric_cols)
df_mode_imputed[non_numeric_cols] = df_missing[non_numeric_cols].reset_index(drop=True)

# Visualize the data after imputation


plt.figure(figsize=(12, 6))

plt.subplot(1, 2, 1)
plt.title("Data with Missing Values")
df_missing[numeric_cols].plot(kind='line', marker='o', ax=plt.gca())
plt.legend(loc='upper left')
plt.xlabel("Index")
plt.ylabel("Feature Values")

plt.subplot(1, 2, 2)
plt.title("Data After Imputation (Mean, Median, Mode)")
plt.plot(df_mean_imputed[numeric_cols], marker='o', label='Mean Imputation')
plt.plot(df_median_imputed[numeric_cols], marker='x', label='Median Imputation')
plt.plot(df_mode_imputed[numeric_cols], marker='^', label='Mode Imputation')
plt.legend(loc='upper left')
plt.xlabel("Index")
plt.ylabel("Imputed Feature Values")
plt.tight_layout()
plt.show()
# Show the imputed datasets
print("\nMean Imputed Data (First 5 Rows):")
print(df_mean_imputed.head())

print("\nMedian Imputed Data (First 5 Rows):")


print(df_median_imputed.head())

print("\nMode Imputed Data (First 5 Rows):")


print(df_mode_imputed.head())

Output:
Conclusion: Imputation fills missing data to ensure completeness, consistency, and better analysis,
improving model performance across various fields.

You might also like