0% found this document useful (0 votes)
33 views4 pages

ASSi2 DSBDA

This document discusses handling missing values and outliers in a dataset. It uses various techniques like mode, mean, and median imputation to handle missing categorical and numeric values. It also identifies outliers using z-scores and applies a square root transformation to make the distribution more symmetric.

Uploaded by

adagalepayale023
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
33 views4 pages

ASSi2 DSBDA

This document discusses handling missing values and outliers in a dataset. It uses various techniques like mode, mean, and median imputation to handle missing categorical and numeric values. It also identifies outliers using z-scores and applies a square root transformation to make the distribution more symmetric.

Uploaded by

adagalepayale023
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 4

Assi2_DSBDA

Importing all the required libraries


#used for data manipulation
import pandas as pd
#provides support for large, multi-dimensional arrays and matrices, along with mathematical
functions to operate on these arrays.
import numpy as np
#SimpleImputer is used for handling missing values in a dataset
from sklearn.impute import SimpleImputer
#The LabelEncoder is used for encoding categorical variables into numerical values.
from sklearn.preprocessing import LabelEncoder
#Z-score is a measure of how many standard deviations a data point is from the mean.
from scipy.stats import zscore
from scipy.stats import zscore, skew, shapiro, probplot

from scipy.stats import zscore, skew, shapiro, probplot


import matplotlib.pyplot as plt # Import matplotlib.pyplot
import seaborn as sns

# Load the dataset


data = pd.read_csv("/Users/apple/Downloads/BVCOEW/TE SEM 6/DSBDA
Practicals/academic.csv")

Check for any missing values and displaying them


print("Missing values before handling: ")
missing_values = data.isnull().sum() #isnull()-->checks for any missing value(gives as
Boolean) and sum() is used to show the sum of all the missing values

Missing values before handling:


print("Missing Values:")
print(missing_values)

for column in data.columns:


print(f"\nColumn: {column}")
print(data[column].head()) #head() method returns the first 5
data values

Handling of Categorical with Mode:


Handling i.e imputing the missing Values with various strategies:
#The SimpleImputer is used for handling missing values in a dataset by
filling them with a specified strategy (mean, median,mode).
handle_missing_values_categorical =
SimpleImputer(strategy='most_frequent') #handle strings with mode
data_categorical = data.select_dtypes(exclude='number')
data[data_categorical.columns] =
handle_missing_values_categorical.fit_transform(data_categorical)
#fit_transform calculates the most frequent value for each categorical
column in the training data (data_categorical) and then replaces
missing values with these calculated values.

Handling Numeric with Mode


#handle_missing_values = SimpleImputer(strategy='most_frequent') #handle
numeric with mode
data_numeric = data.select_dtypes(include='number') #creates a new DataFrame named
data_numeric,selecting only the numeric columns
#data[data_numeric.columns] = handle_missing_values.fit_transform(data_numeric) #for
most_frequent numeric
#fit_transform calculates the most frequent value for each categorical column in the training
data (data_categorical) and then replaces missing values with these calculated values.

Handling with Mean


#handle_missing_values_numeric_mean = SimpleImputer(strategy='mean') #handle
numeric with mean
#data[data_numeric.columns] =
handle_missing_values_numeric_mean.fit_transform(data_numeric)
#fit_transform calculates the most frequent value for each categorical column in the training
data (data_categorical) and then replaces missing values with these calculated values.

Handling with Median


handle_missing_values_numeric_median = SimpleImputer(strategy='median') #handle
numeric with median
data[data_numeric.columns] =
handle_missing_values_numeric_median.fit_transform(data_numeric)
#fit_transform calculates the most frequent value for each categorical column in the training
data (data_categorical) and then replaces missing values with these calculated values.

Display the data after handling missing values:


print("\nData after handling missing values:")
print(data)
Now handling of outliers (an outlier is an extreme value that deviates from the general pattern or
distribution of the data.)
A z-score is a measure that tells you how far away a particular data point is from the average (or mean) of
a group of data points, expressed in terms of standard deviations.
#Calculate Z-Scores:
z_scores = zscore(data.select_dtypes(include='number'), axis=0) #axis=0 indicates that z-
scores should be calculated along columns

Outliers are data points that are significantly different from the majority of the other data points in a set.
#Identify Outliers:
outliers = (z_scores > 3) | (z_scores < -3)

The mask method is used to replace values in the DataFrame based on a condition.
#Mask Outliers in the DataFrame:
data_no_outliers = data.select_dtypes(include='number').mask(outliers, np.nan)

Data Transformation: Using Square root transformation


for column in data_no_outliers.columns:
print(f"\nColumn: {column}")
print(data_no_outliers[column].head())

Calculate skewness before transformation


skew_before = data_no_outliers['Fees'].skew()
print(f"\nSkewness before transformation: {skew_before}")

# Apply square root transformation to 'Fees' column


#Applying a square root transformation to the 'Fees' column in the DataFrame
data_no_outliers and creating a new column named 'Fees_sqrt' to store the transformed
values.
data_no_outliers['Fees_sqrt'] = np.sqrt(data_no_outliers['Fees'])

Calculate skewness after square root transformation


skew_after_sqrt = data_no_outliers['Fees_sqrt'].skew()
print(f"\nSkewness after square root transformation: {skew_after_sqrt}")

In this case, the skewness before the transformation was -0.5783358295678959, indicating a slight
negative skewness,which means the distribution was already somewhat left-skewed. After applying the
logarithm transformation, the skewness became more negative (-1.0555250171550188), suggesting a
further shift towards the left. Reducing skewness is one step towards achieving a more symmetric
distribution, it's important to note that achieving a perfectly normal distribution is not always necessary or
possible in practice. However, making the distribution more symmetric and closer to normal can be
beneficial
Displaying the transformed data:
print("\nTransformed Data: ")

Transformed Data:
#Focus on the output of Fees and Fees_sqrt (Data is transformed from a big number to its
square root for easier understanding and handling)
print(data_no_outliers)

Plot histogram and Q-Q plot after square root transformation


# Plot histogram and Q-Q plot after square root transformation
plt.figure(figsize=(12, 6)) # This line creates a new figure with a specified size of 12 inches in
width and 6 inches in height.

plt.subplot(1, 2, 1) #This line creates a subplot grid with 1 row and 2 columns and selects
the first subplot (leftmost). The parameters (1, 2, 1) specify that there is 1 row, 2 columns,
and the current plot being referred to is the first one.
sns.histplot(data_no_outliers['Fees_sqrt'], kde=True) #This line creates a histogram using
Seaborn's histplot() function. It plots the distribution of data in the 'Fees_sqrt' column of the
DataFrame data_no_outliers
plt.title('Histogram of Square Root-transformed Fees')

plt.subplot(1, 2, 2)
probplot(data_no_outliers['Fees_sqrt'], dist="norm", plot=plt) # Use probplot directly
plt.title('Q-Q Plot of Square Root-transformed Fees')
plt.show()

You might also like