0% found this document useful (0 votes)

33 views4 pages

ASSi2 DSBDA

This document discusses handling missing values and outliers in a dataset. It uses various techniques like mode, mean, and median imputation to handle missing categorical and numeric values. It also identifies outliers using z-scores and applies a square root transformation to make the distribution more symmetric.

Uploaded by

adagalepayale023

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

33 views4 pages

ASSi2 DSBDA

Uploaded by

adagalepayale023

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

You are on page 1/ 4

Assi2_DSBDA

Importing all the required libraries

#used for data manipulation
import pandas as pd
#provides support for large, multi-dimensional arrays and matrices, along with mathematical
functions to operate on these arrays.
import numpy as np
#SimpleImputer is used for handling missing values in a dataset
from sklearn.impute import SimpleImputer
#The LabelEncoder is used for encoding categorical variables into numerical values.
from sklearn.preprocessing import LabelEncoder
#Z-score is a measure of how many standard deviations a data point is from the mean.
from scipy.stats import zscore
from scipy.stats import zscore, skew, shapiro, probplot

from scipy.stats import zscore, skew, shapiro, probplot

import matplotlib.pyplot as plt # Import matplotlib.pyplot
import seaborn as sns

# Load the dataset

data = pd.read_csv("/Users/apple/Downloads/BVCOEW/TE SEM 6/DSBDA
Practicals/academic.csv")

Check for any missing values and displaying them

print("Missing values before handling: ")
missing_values = data.isnull().sum() #isnull()-->checks for any missing value(gives as
Boolean) and sum() is used to show the sum of all the missing values

Missing values before handling:

print("Missing Values:")
print(missing_values)

for column in data.columns:

print(f"\nColumn: {column}")
print(data[column].head()) #head() method returns the first 5
data values

Handling of Categorical with Mode:

Handling i.e imputing the missing Values with various strategies:
#The SimpleImputer is used for handling missing values in a dataset by
filling them with a specified strategy (mean, median,mode).
handle_missing_values_categorical =
SimpleImputer(strategy='most_frequent') #handle strings with mode
data_categorical = data.select_dtypes(exclude='number')
data[data_categorical.columns] =
handle_missing_values_categorical.fit_transform(data_categorical)
#fit_transform calculates the most frequent value for each categorical
column in the training data (data_categorical) and then replaces
missing values with these calculated values.

Handling Numeric with Mode

#handle_missing_values = SimpleImputer(strategy='most_frequent') #handle
numeric with mode
data_numeric = data.select_dtypes(include='number') #creates a new DataFrame named
data_numeric,selecting only the numeric columns
#data[data_numeric.columns] = handle_missing_values.fit_transform(data_numeric) #for
most_frequent numeric
#fit_transform calculates the most frequent value for each categorical column in the training
data (data_categorical) and then replaces missing values with these calculated values.

Handling with Mean

#handle_missing_values_numeric_mean = SimpleImputer(strategy='mean') #handle
numeric with mean
#data[data_numeric.columns] =
handle_missing_values_numeric_mean.fit_transform(data_numeric)
#fit_transform calculates the most frequent value for each categorical column in the training
data (data_categorical) and then replaces missing values with these calculated values.

Handling with Median

handle_missing_values_numeric_median = SimpleImputer(strategy='median') #handle
numeric with median
data[data_numeric.columns] =
handle_missing_values_numeric_median.fit_transform(data_numeric)
#fit_transform calculates the most frequent value for each categorical column in the training
data (data_categorical) and then replaces missing values with these calculated values.

Display the data after handling missing values:

print("\nData after handling missing values:")
print(data)
Now handling of outliers (an outlier is an extreme value that deviates from the general pattern or
distribution of the data.)
A z-score is a measure that tells you how far away a particular data point is from the average (or mean) of
a group of data points, expressed in terms of standard deviations.
#Calculate Z-Scores:
z_scores = zscore(data.select_dtypes(include='number'), axis=0) #axis=0 indicates that z-
scores should be calculated along columns

Outliers are data points that are significantly different from the majority of the other data points in a set.
#Identify Outliers:
outliers = (z_scores > 3) | (z_scores < -3)

The mask method is used to replace values in the DataFrame based on a condition.
#Mask Outliers in the DataFrame:
data_no_outliers = data.select_dtypes(include='number').mask(outliers, np.nan)

Data Transformation: Using Square root transformation

for column in data_no_outliers.columns:
print(f"\nColumn: {column}")
print(data_no_outliers[column].head())

Calculate skewness before transformation

skew_before = data_no_outliers['Fees'].skew()
print(f"\nSkewness before transformation: {skew_before}")

# Apply square root transformation to 'Fees' column

#Applying a square root transformation to the 'Fees' column in the DataFrame
data_no_outliers and creating a new column named 'Fees_sqrt' to store the transformed
values.
data_no_outliers['Fees_sqrt'] = np.sqrt(data_no_outliers['Fees'])

Calculate skewness after square root transformation

skew_after_sqrt = data_no_outliers['Fees_sqrt'].skew()
print(f"\nSkewness after square root transformation: {skew_after_sqrt}")

In this case, the skewness before the transformation was -0.5783358295678959, indicating a slight
negative skewness,which means the distribution was already somewhat left-skewed. After applying the
logarithm transformation, the skewness became more negative (-1.0555250171550188), suggesting a
further shift towards the left. Reducing skewness is one step towards achieving a more symmetric
distribution, it's important to note that achieving a perfectly normal distribution is not always necessary or
possible in practice. However, making the distribution more symmetric and closer to normal can be
beneficial
Displaying the transformed data:
print("\nTransformed Data: ")

Transformed Data:
#Focus on the output of Fees and Fees_sqrt (Data is transformed from a big number to its
square root for easier understanding and handling)
print(data_no_outliers)

Plot histogram and Q-Q plot after square root transformation

# Plot histogram and Q-Q plot after square root transformation
plt.figure(figsize=(12, 6)) # This line creates a new figure with a specified size of 12 inches in
width and 6 inches in height.

plt.subplot(1, 2, 1) #This line creates a subplot grid with 1 row and 2 columns and selects
the first subplot (leftmost). The parameters (1, 2, 1) specify that there is 1 row, 2 columns,
and the current plot being referred to is the first one.
sns.histplot(data_no_outliers['Fees_sqrt'], kde=True) #This line creates a histogram using
Seaborn's histplot() function. It plots the distribution of data in the 'Fees_sqrt' column of the
DataFrame data_no_outliers
plt.title('Histogram of Square Root-transformed Fees')

plt.subplot(1, 2, 2)
probplot(data_no_outliers['Fees_sqrt'], dist="norm", plot=plt) # Use probplot directly
plt.title('Q-Q Plot of Square Root-transformed Fees')
plt.show()

Project 02 - Business Decisions
No ratings yet
Project 02 - Business Decisions
27 pages
(Feature Engineering) (Extended-Cheatsheet)
No ratings yet
(Feature Engineering) (Extended-Cheatsheet)
9 pages
Mat112 - Chapter 1 - Review On Algebra
No ratings yet
Mat112 - Chapter 1 - Review On Algebra
16 pages
Exp-2 ML
No ratings yet
Exp-2 ML
6 pages
Exp 2
No ratings yet
Exp 2
6 pages
Da Program Upto 6
No ratings yet
Da Program Upto 6
20 pages
Group A Assignment No2 Writeup
No ratings yet
Group A Assignment No2 Writeup
9 pages
DataAnalytics Lab Manual
No ratings yet
DataAnalytics Lab Manual
35 pages
DSBDA Lab Assignment No 2
No ratings yet
DSBDA Lab Assignment No 2
7 pages
ML Self Unit 2
No ratings yet
ML Self Unit 2
20 pages
DA Programs
No ratings yet
DA Programs
44 pages
ML 8 Program
No ratings yet
ML 8 Program
5 pages
Week 6 - Data Cleaning
No ratings yet
Week 6 - Data Cleaning
8 pages
Machine Learning Project Roadmap
No ratings yet
Machine Learning Project Roadmap
4 pages
Https Raw - Githubusercontent.com Joelgrus Data-Science-From-Scratch Master Code Working With Data
No ratings yet
Https Raw - Githubusercontent.com Joelgrus Data-Science-From-Scratch Master Code Working With Data
7 pages
ML LAB Manual-1
No ratings yet
ML LAB Manual-1
33 pages
Data Analytics Lab Manual
No ratings yet
Data Analytics Lab Manual
26 pages
Slides On DataII
No ratings yet
Slides On DataII
26 pages
Experiment No. 5: Objective
No ratings yet
Experiment No. 5: Objective
5 pages
Machine Learning Lab File
No ratings yet
Machine Learning Lab File
45 pages
Data Analytics Lab Manual - 250402 - 095326
No ratings yet
Data Analytics Lab Manual - 250402 - 095326
58 pages
Data Science Practicals
No ratings yet
Data Science Practicals
47 pages
ADS EXP Assignments
No ratings yet
ADS EXP Assignments
38 pages
Data Warehousing and Data Mining
No ratings yet
Data Warehousing and Data Mining
24 pages
ModuleAr Merged
No ratings yet
ModuleAr Merged
42 pages
Exploratory Data Analysis
No ratings yet
Exploratory Data Analysis
48 pages
Advance Python
No ratings yet
Advance Python
5 pages
DS Problem Statements and Codes
No ratings yet
DS Problem Statements and Codes
21 pages
Cleaning Data in Python
No ratings yet
Cleaning Data in Python
8 pages
100 Days of Machine Learning
No ratings yet
100 Days of Machine Learning
14 pages
Machine Learning Lab Manual
No ratings yet
Machine Learning Lab Manual
42 pages
ML Unit 2
No ratings yet
ML Unit 2
52 pages
EDA Explanations
No ratings yet
EDA Explanations
22 pages
DA Lab
No ratings yet
DA Lab
27 pages
Chandigarh Group of Colleges College of Engineering Landran, Mohali
No ratings yet
Chandigarh Group of Colleges College of Engineering Landran, Mohali
47 pages
1data Cleansing Cheklist
No ratings yet
1data Cleansing Cheklist
2 pages
Data Mining Lab 03
No ratings yet
Data Mining Lab 03
10 pages
Data Visualization EDA-print
No ratings yet
Data Visualization EDA-print
18 pages
Lecture Material 10
No ratings yet
Lecture Material 10
9 pages
04 DS 2023
No ratings yet
04 DS 2023
63 pages
Dsbda Lab - 2.1 - 1736750718198
No ratings yet
Dsbda Lab - 2.1 - 1736750718198
9 pages
DSBDA Practicals
No ratings yet
DSBDA Practicals
16 pages
Abhiml ML File
No ratings yet
Abhiml ML File
74 pages
PP DWDM 4 5
No ratings yet
PP DWDM 4 5
26 pages
ML Notes
No ratings yet
ML Notes
44 pages
Data Analytics Lab Manual
No ratings yet
Data Analytics Lab Manual
47 pages
Subset Selection Class Assignment
No ratings yet
Subset Selection Class Assignment
5 pages
Eda Indepth
No ratings yet
Eda Indepth
19 pages
Machine Learning Project Checklist
No ratings yet
Machine Learning Project Checklist
30 pages
DWDM Lab Manual
No ratings yet
DWDM Lab Manual
32 pages
Data Science Manual
No ratings yet
Data Science Manual
16 pages
ML Ex2
No ratings yet
ML Ex2
7 pages
Assignment 1 - LP1
No ratings yet
Assignment 1 - LP1
14 pages
Data Science Lab Manual
No ratings yet
Data Science Lab Manual
32 pages
HIV Regression Source Code
No ratings yet
HIV Regression Source Code
26 pages
Dsi237 Group 2
No ratings yet
Dsi237 Group 2
27 pages
DWDM Lab Report
No ratings yet
DWDM Lab Report
26 pages
Dsbda Ass2
No ratings yet
Dsbda Ass2
49 pages
MACHINE LEARNING Manual
No ratings yet
MACHINE LEARNING Manual
36 pages
Data Preprocessing 2
No ratings yet
Data Preprocessing 2
5 pages
Handling Missing Values in A Real-Time Dataset During
No ratings yet
Handling Missing Values in A Real-Time Dataset During
5 pages
Random Sample Consensus: Robust Estimation in Computer Vision
From Everand
Random Sample Consensus: Robust Estimation in Computer Vision
Fouad Sabry
No ratings yet
Data Structure Interview Questions
No ratings yet
Data Structure Interview Questions
17 pages
12th PT Paper
No ratings yet
12th PT Paper
4 pages
3226 Coding Sol
No ratings yet
3226 Coding Sol
2 pages
AI Question Bank - For UT-II-23-24
100% (1)
AI Question Bank - For UT-II-23-24
2 pages
DSTN Merged - CSI ZC447 ES ZC447IS ZC447SS ZC447 CH 12-14
No ratings yet
DSTN Merged - CSI ZC447 ES ZC447IS ZC447SS ZC447 CH 12-14
23 pages
Pa - Unit - Iv
No ratings yet
Pa - Unit - Iv
45 pages
HART® Transmitter Calibration
No ratings yet
HART® Transmitter Calibration
16 pages
Bhopal XII CS QP - PRE TERM END 2
No ratings yet
Bhopal XII CS QP - PRE TERM END 2
4 pages
01 - LEGRAND - Cable F - UTP - LSZH Cat6A
No ratings yet
01 - LEGRAND - Cable F - UTP - LSZH Cat6A
2 pages
STC St04014at B757
No ratings yet
STC St04014at B757
1 page
Sisco
No ratings yet
Sisco
10 pages
Automatic & Manual Vacuum Cleaning Robot
No ratings yet
Automatic & Manual Vacuum Cleaning Robot
4 pages
How To Use The Tone Curve Panel in Lightroom
No ratings yet
How To Use The Tone Curve Panel in Lightroom
1 page
Civil 3d Road Design General Workflow
100% (1)
Civil 3d Road Design General Workflow
3 pages
LEAK DETECTION IN PIPELINE-jijo
No ratings yet
LEAK DETECTION IN PIPELINE-jijo
17 pages
Types of Event: What Is An Event?
No ratings yet
Types of Event: What Is An Event?
6 pages
Detecting Outliers With Grubbs' Test
No ratings yet
Detecting Outliers With Grubbs' Test
8 pages
Cloud Functions Service Level Agreement Verified PDF
No ratings yet
Cloud Functions Service Level Agreement Verified PDF
3 pages
Vector Graphics Algo
No ratings yet
Vector Graphics Algo
24 pages
Chapter 1
No ratings yet
Chapter 1
21 pages
Swa-Adhyayan Proposal Letter
No ratings yet
Swa-Adhyayan Proposal Letter
3 pages
Sensors: A Stress Sensor Based On Galvanic Skin Response (GSR) Controlled by Zigbee
No ratings yet
Sensors: A Stress Sensor Based On Galvanic Skin Response (GSR) Controlled by Zigbee
30 pages
1Z0 1066 24 Demo
No ratings yet
1Z0 1066 24 Demo
5 pages
Manual Altamira
No ratings yet
Manual Altamira
18 pages
EDA - Ciclo2022 - 2 - EstadisticaByvariada
No ratings yet
EDA - Ciclo2022 - 2 - EstadisticaByvariada
9 pages
Javell: Address: 23 A East Avenue, Linstead P.O., Jamaica Email: Telephone: (876) 484-8766 1876-416-8765
No ratings yet
Javell: Address: 23 A East Avenue, Linstead P.O., Jamaica Email: Telephone: (876) 484-8766 1876-416-8765
3 pages
(6es7952-1al00-0aa0) Memory Card
No ratings yet
(6es7952-1al00-0aa0) Memory Card
1 page
Quick Start Guide - SAP BTP SDK For iOS
No ratings yet
Quick Start Guide - SAP BTP SDK For iOS
26 pages
Wonder
100% (3)
Wonder
220 pages
DA-100 Mod6-ENU-PowerPoint
No ratings yet
DA-100 Mod6-ENU-PowerPoint
26 pages
Exercises: 2 / Basic Structures: Sets, Functions, Sequences, Sums, and Matrices
No ratings yet
Exercises: 2 / Basic Structures: Sets, Functions, Sequences, Sums, and Matrices
2 pages
Data Exam 3
No ratings yet
Data Exam 3
42 pages

ASSi2 DSBDA

Uploaded by

ASSi2 DSBDA

Uploaded by

Assi2_DSBDA

Importing all the required libraries

from scipy.stats import zscore, skew, shapiro, probplot

# Load the dataset

Check for any missing values and displaying them

Missing values before handling:

for column in data.columns:

Handling of Categorical with Mode:

Handling Numeric with Mode

Handling with Mean

Handling with Median

Display the data after handling missing values:

Data Transformation: Using Square root transformation

Calculate skewness before transformation

# Apply square root transformation to 'Fees' column

Calculate skewness after square root transformation

Plot histogram and Q-Q plot after square root transformation

You might also like