ADS Exp2
ADS Exp2
Date:
EXPERIMENT NO.:02
There are three common techniques for imputation based on statistical measures:
1. Healthcare: Fill missing patient info (e.g., age, blood pressure) for disease prediction.
2. E-commerce: Handle gaps in sales, ratings, or user data for recommendations.
3. Finance: Impute missing stock prices or credit scores for financial models.
4. Education: Replace missing test scores or attendance for performance analysis.
5. Marketing: Fill gaps in customer demographics for targeted ads.
6. Real Estate: Address missing property details for price prediction.
7. Social Media: Handle incomplete engagement data (e.g., likes, shares) for trend analysis.
8. Logistics: Fill gaps in vehicle mileage or delivery times for optimization.
9. Big Data: Clean large datasets for analytics and trend prediction.
Program:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.impute import SimpleImputer
# Show the first few rows of the dataset to understand its structure
print("Original Dataset:")
print(df.head())
# Introduce missing values randomly (10% of data will be missing for demonstration)
np.random.seed(42)
missing_mask = np.random.rand(*df.shape) < 0.1 # 10% missing data
df_missing = df.copy()
df_missing = df_missing.mask(missing_mask)
# Impute missing values for numeric columns using Mean, Median, and Mode
# 1. Mean Imputation
mean_imputer = SimpleImputer(strategy='mean')
df_mean_imputed=pd.DataFrame(mean_imputer.fit_transform(df_missing[numeric_cols]),
columns=numeric_cols)
df_mean_imputed[non_numeric_cols] = df_missing[non_numeric_cols].reset_index(drop=True)
# 2. Median Imputation
median_imputer = SimpleImputer(strategy='median')
df_median_imputed=pd.DataFrame(median_imputer.fit_transform(df_missing[numeric_cols]),
columns=numeric_cols)
df_median_imputed[non_numeric_cols] = df_missing[non_numeric_cols].reset_index(drop=True)
# 3. Mode Imputation
mode_imputer = SimpleImputer(strategy='most_frequent')
df_mode_imputed=pd.DataFrame(mode_imputer.fit_transform(df_missing[numeric_cols]),
columns=numeric_cols)
df_mode_imputed[non_numeric_cols] = df_missing[non_numeric_cols].reset_index(drop=True)
plt.subplot(1, 2, 1)
plt.title("Data with Missing Values")
df_missing[numeric_cols].plot(kind='line', marker='o', ax=plt.gca())
plt.legend(loc='upper left')
plt.xlabel("Index")
plt.ylabel("Feature Values")
plt.subplot(1, 2, 2)
plt.title("Data After Imputation (Mean, Median, Mode)")
plt.plot(df_mean_imputed[numeric_cols], marker='o', label='Mean Imputation')
plt.plot(df_median_imputed[numeric_cols], marker='x', label='Median Imputation')
plt.plot(df_mode_imputed[numeric_cols], marker='^', label='Mode Imputation')
plt.legend(loc='upper left')
plt.xlabel("Index")
plt.ylabel("Imputed Feature Values")
plt.tight_layout()
plt.show()
# Show the imputed datasets
print("\nMean Imputed Data (First 5 Rows):")
print(df_mean_imputed.head())
Output:
Conclusion: Imputation fills missing data to ensure completeness, consistency, and better analysis,
improving model performance across various fields.