0% found this document useful (0 votes)

24 views8 pages

1 - Data Preprocessing and Cleaning - 55

The document outlines an experiment on data preprocessing and cleaning, detailing objectives, methods, and programming assignments related to handling missing values, data transformation, outlier detection, feature engineering, and dimensionality reduction using PCA. The experiment aims to enhance data quality for improved machine learning model performance. It includes practical implementations using Python's Pandas and various machine learning techniques.

Uploaded by

AYAN PRAMANICK 12 SC A

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

24 views8 pages

1 - Data Preprocessing and Cleaning - 55

Uploaded by

AYAN PRAMANICK 12 SC A

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

You are on page 1/ 8

Title of the Experiment: Data

Preprocessing and Cleaning

Your Name: AYAN PRAMANICK
UID: TNU2022053200055
Instructor’s Name: Dr. Madhu sudan Das
Date of Submission: 2.09.24
Subject Name: Introduction to Machine
Learning
Objective of the Experiment:
In this experiment we were able to understand about data cleaning, how to deal with
missing value in data, transforming data, and feature scaling. By the end of this experiment,
students should be able critically complete the missing data on a data set and apply most of
the transformation techniques on the org. to read the above dataset and be in a position to
explain the impact of such different techniques on the dataset. and based on the training
dataset and on model performance.

Introduction:
Background Information:
Before feeding data into the training algorithm, data preprocessing is a must do step which
aims at improving data quality in order to accomplish the overall goal of getting better
performance of the models. This includes feature pre-processing; managing missing values,
normalization, and feature scaling and data transformation. Otherwise, when building
models, the result is hindered with low quality data while preprocessing makes sure the
dataset is fit for analysis.

Hypothesis:
Pre-processing will enhance the quality of the obtained data and hence increase the
performance of the machine learning techniques. Such approaches as handling missing
values and scaling of features will cause different effects on the model.

Relevance:
Historically, data preprocessing is one of the critical steps in the growth of high-performance
models of machine learning. Preprocessing is an important step in improving model
outcomes because it removes all inconsistencies and converts the data into a standard
format for ‘feeding’ to the models.

Methods:
Assignment 1: Data Cleaning and Handling Missing Values:
1. Load the Dataset
 Use Python’s Pandas library to load a dataset (e.g., a CSV file) that contains
missing values.
2. Identify Missing Values
3. Handle Missing Values
4. Implement various strategies to handle missing data
 Remove Rows/Columns with Missing Data
5. Compare Impact
 See how the dataset changes after handling missing values.
Assignment 2: Data Transformation and Feature Scaling:
1. Load a given dataset containing numerical and categorical features.
2. Perform data transformation tasks such as log transformation, power
transformation, and one-hot encoding.
3. Apply feature scaling techniques like Min-Max Scaling, Standardization, and
Robust Scaling.
4. Evaluate the impact of different transformations and scaling techniques on a
simple machine learning model (e.g., Linear Regression). –not covered

Assignment 3: Outlier Detection and Treatment

1. Load a given dataset with potential outliers.
2. Use statistical methods (e.g., Z-score, IQR) and visualization techniques (e.g.,
box plots, scatter plots) to identify outliers.
3. Implement different methods to treat outliers (e.g., removal, capping, or
transformation).
4. Analyze how the presence of outliers and their treatment affects data
distribution and model performance.

Assignment 4: Data Imputation and Feature Engineering

1. Load a dataset with missing and incomplete data.
2. Implement advanced data imputation techniques such as K-Nearest
Neighbors (KNN) imputation or Multiple Imputation by Chained Equations
(MICE).
3. Perform feature engineering by creating new features from existing ones (e.g.,
combining features, polynomial features, or interaction terms).
4. Evaluate the impact of the new features on model performance.

Assignment 5: Dimensionality Reduction using PCA

1. Load a high-dimensional dataset.
2. Apply PCA to reduce the number of features while retaining as much variance
as possible.
3. Visualize the principal components and explain the variance retained by each
component.
4. Compare the performance of a machine learning model before and after
applying PCA.

Programming
Assignment 1: Data Cleaning and Handling Missing Values
import pandas as pd
import numpy as np

# Load the dataset

df = pd.read_csv('/content/TSLA.csv')
print(df)

# Identify missing values

missing_values = df.isnull().sum()
print(missing_values)

# Replace missing values with mean

df['Close'] = df['Close'].fillna(df['Close'].mean())

# Remove rows with missing values

df = df.dropna()

# Forward fill missing values

df['Close'] = df['Close'].fillna(method='ffill')
print("\nData after forward filling missing values:")
print(df.head())
output:

Assignment 2: Data Transformation and Feature Scaling

import pandas as pd
from sklearn.preprocessing import MinMaxScaler, StandardScaler

# Load the dataset

df = pd.read_csv('/content/TSLA.csv')
print('Your Dataset',df)

# Log transformation
df['Close'] = np.log(df['Close'])

# One-hot encoding
df = pd.get_dummies(df, columns=['Close'])

# Min-Max Scaling
scaler = MinMaxScaler()
df['High'] = scaler.fit_transform(df[['High']])

# Standardization
scaler = StandardScaler()

df['High'] = scaler.fit_transform(df[['High']])
print("\nData after Standardization:")
print(df.head())
Output:

Assignment 3: Outlier Detection and Treatment

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

# Load the dataset

df = pd.read_csv('/content/TSLA.csv')

# Identify outliers using Z-score

z_scores = np.abs((df['High'] - df['High'].mean()) / df['High'].std())
outliers = df[z_scores > 3]

# Remove outliers
df = df[z_scores <= 3]

# Visualize outliers using box plot

plt.boxplot(df['High'])
plt.show()
Output:

Assignment 4: Data Imputation and Feature Engineering

import pandas as pd
from sklearn.impute import KNNImputer

# Load the dataset

df = pd.read_csv('/content/TSLA.csv')

# KNN imputation
imputer = KNNImputer()
print(df.head())
df['High'] = imputer.fit_transform(df[['High']])

# Create new feature by combining existing features

df['new_feature'] = df['High'] + df['Close']
print("\nData with new feature:")
print(df.head())
Output:

Assignment 5: Dimensionality Reduction using PCA

# Import necessary libraries
from sklearn.datasets import load_iris
import pandas as pd
from sklearn.decomposition import PCA
import matplotlib.pyplot as plt

# Load the Iris dataset

iris = load_iris()
df = pd.DataFrame(data=iris.data, columns=iris.feature_names)
df['target'] = iris.target

# Apply PCA to reduce the number of features

pca = PCA(n_components=2)
pca_df = pca.fit_transform(df.drop('target', axis=1))
pca_df = pd.DataFrame(data=pca_df, columns=['PC1', 'PC2'])
pca_df['target'] = df['target']

# Visualize the principal components

plt.figure(figsize=(8, 6))
for target in pca_df['target'].unique():
plt.scatter(pca_df.loc[pca_df['target'] == target, 'PC1'], pca_df.loc[pca_df['target'] ==
target, 'PC2'])
plt.xlabel('PC1')
plt.ylabel('PC2')
plt.title('Principal Components')
plt.show()

# Print the explained variance ratio

print(pca.explained_variance_ratio_)
Output:

Kazadi Joel 9213934 DLMDSPWP01
No ratings yet
Kazadi Joel 9213934 DLMDSPWP01
18 pages
Training Notes (4.2 Printed Circuit Boards)
75% (4)
Training Notes (4.2 Printed Circuit Boards)
12 pages
RAADS-R Test: Ritvo Autism Asperger Diagnostic Scale-Revised
100% (3)
RAADS-R Test: Ritvo Autism Asperger Diagnostic Scale-Revised
10 pages
Machine Learning Algorithms PDF
100% (1)
Machine Learning Algorithms PDF
148 pages
Manual: High Pressure Cleaner MC 300/21
No ratings yet
Manual: High Pressure Cleaner MC 300/21
46 pages
Regulatory Guidelines To Medical Devices
No ratings yet
Regulatory Guidelines To Medical Devices
8 pages
04 Samss 035
No ratings yet
04 Samss 035
16 pages
Situational Leadership Theory Proposes That Effective Leadership Requires A Rational Understanding of The Situation and An Appropriate Response
No ratings yet
Situational Leadership Theory Proposes That Effective Leadership Requires A Rational Understanding of The Situation and An Appropriate Response
6 pages
Next Gen HD LED Lit Videowall User Guide PDF
No ratings yet
Next Gen HD LED Lit Videowall User Guide PDF
109 pages
QI Business Presentation 2
No ratings yet
QI Business Presentation 2
35 pages
LCD Panel Repairing Book - Parte3
No ratings yet
LCD Panel Repairing Book - Parte3
30 pages
Development of Visualization
100% (1)
Development of Visualization
11 pages
Troubleshooting GEFANUC 90 30
No ratings yet
Troubleshooting GEFANUC 90 30
18 pages
Child Friendly School S High School 1
No ratings yet
Child Friendly School S High School 1
17 pages
Exemples de Writing English BAC
No ratings yet
Exemples de Writing English BAC
3 pages
Validation of Sitewind Version 4
No ratings yet
Validation of Sitewind Version 4
25 pages
Global Human Resource Management: Instructor Mr. Shyamasundar Tripathy
No ratings yet
Global Human Resource Management: Instructor Mr. Shyamasundar Tripathy
18 pages
Student Guide M2
No ratings yet
Student Guide M2
49 pages
Data Pre-Processing Python For Beginner
No ratings yet
Data Pre-Processing Python For Beginner
12 pages
Data Pre-Processing Python For Beginner
No ratings yet
Data Pre-Processing Python For Beginner
12 pages
Bangladeshi Labour Migration To The Gulf States Patterns of Recruitment
No ratings yet
Bangladeshi Labour Migration To The Gulf States Patterns of Recruitment
19 pages
Biblio Tatla Aspects of Universality in Modern and Postmodern Architecture
No ratings yet
Biblio Tatla Aspects of Universality in Modern and Postmodern Architecture
3 pages
Subject: Physics Grade: 10-SCIENCE, 10-TVET Week: I Topic: Time
No ratings yet
Subject: Physics Grade: 10-SCIENCE, 10-TVET Week: I Topic: Time
1 page
Adobe Scan 04-Mar-2024
No ratings yet
Adobe Scan 04-Mar-2024
12 pages
Final Theory 2022 en
No ratings yet
Final Theory 2022 en
31 pages
Schools Division of Parañaque City Technology and Livelihood Education Electrical Installation & Maintenance 9 Quarter 4 Week 7 & 8 Wiring Diagrams
No ratings yet
Schools Division of Parañaque City Technology and Livelihood Education Electrical Installation & Maintenance 9 Quarter 4 Week 7 & 8 Wiring Diagrams
4 pages
Your Reliance Bill: Summary of Current Charges Amount (RS)
No ratings yet
Your Reliance Bill: Summary of Current Charges Amount (RS)
3 pages
Lab 08 - Data Preprocessing
No ratings yet
Lab 08 - Data Preprocessing
9 pages
ASSi2 DSBDA
No ratings yet
ASSi2 DSBDA
4 pages
List of MCA For CSC
No ratings yet
List of MCA For CSC
9 pages
Skill
No ratings yet
Skill
42 pages
Experiment-3 31
No ratings yet
Experiment-3 31
9 pages
Dwdm-Lab Manual
No ratings yet
Dwdm-Lab Manual
39 pages
Aml Midsem
No ratings yet
Aml Midsem
59 pages
MSDSModule 2
No ratings yet
MSDSModule 2
35 pages
Be A 65 Ads Exp 3
No ratings yet
Be A 65 Ads Exp 3
6 pages
Data Preprocessing: Essential Steps For Preparing Data Before Modeling
No ratings yet
Data Preprocessing: Essential Steps For Preparing Data Before Modeling
111 pages
Some Exercises
No ratings yet
Some Exercises
9 pages
Data Cleaning Approaches in Machine Learning Algorithms
No ratings yet
Data Cleaning Approaches in Machine Learning Algorithms
8 pages
PW2 DataCleaning
No ratings yet
PW2 DataCleaning
6 pages
Prac 7
No ratings yet
Prac 7
5 pages
The Complete Guide To Data Preprocessing
No ratings yet
The Complete Guide To Data Preprocessing
50 pages
Avinash DA 6
No ratings yet
Avinash DA 6
3 pages
Data Pre Processing
No ratings yet
Data Pre Processing
2 pages
Ads Exp2 C35
No ratings yet
Ads Exp2 C35
9 pages
ML Journal
No ratings yet
ML Journal
53 pages
Data Preprocessing in Python
No ratings yet
Data Preprocessing in Python
3 pages
Experiment No. 5: Objective
No ratings yet
Experiment No. 5: Objective
5 pages
Week4 EnhancedSystemDecomposition Part2
No ratings yet
Week4 EnhancedSystemDecomposition Part2
22 pages
Lecture Material 10
No ratings yet
Lecture Material 10
9 pages
Week 6 - Data Cleaning
No ratings yet
Week 6 - Data Cleaning
8 pages
Okay
No ratings yet
Okay
1 page
ML Lab 3
No ratings yet
ML Lab 3
8 pages
Machine Learning Project Checklist
No ratings yet
Machine Learning Project Checklist
30 pages
ML Self Unit 2
No ratings yet
ML Self Unit 2
20 pages
Sample Questions For Citrix 1y0 312 Exam by Moon
No ratings yet
Sample Questions For Citrix 1y0 312 Exam by Moon
10 pages
Ass 3 - Best
No ratings yet
Ass 3 - Best
10 pages
Lab Assignment - SVM - 2024
No ratings yet
Lab Assignment - SVM - 2024
5 pages
Slides On DataII
No ratings yet
Slides On DataII
26 pages
Advance Python
No ratings yet
Advance Python
5 pages
Data Preprocessing in Machine Learning
No ratings yet
Data Preprocessing in Machine Learning
24 pages
Mini 4
No ratings yet
Mini 4
9 pages
Lesson Plan in Napkin Folding
No ratings yet
Lesson Plan in Napkin Folding
2 pages
Subject - Machine Learning Group - E27-24 Name
No ratings yet
Subject - Machine Learning Group - E27-24 Name
18 pages
Moba Compaction Assistance
No ratings yet
Moba Compaction Assistance
12 pages
Ass 3 - Average
No ratings yet
Ass 3 - Average
6 pages
Exp-2 ML
No ratings yet
Exp-2 ML
6 pages
Data Analytics Lab Manual - 250402 - 095326
No ratings yet
Data Analytics Lab Manual - 250402 - 095326
58 pages
Manual PDS Expt No. 7,8,9
No ratings yet
Manual PDS Expt No. 7,8,9
6 pages
Ads E2
No ratings yet
Ads E2
5 pages
DA Programs
No ratings yet
DA Programs
44 pages
Machine Learning Lab File
No ratings yet
Machine Learning Lab File
45 pages
MLCyber Lab
No ratings yet
MLCyber Lab
9 pages
My First Project
No ratings yet
My First Project
7 pages
Data Cleaning and Preprocessing
No ratings yet
Data Cleaning and Preprocessing
4 pages
ML Lab
No ratings yet
ML Lab
23 pages
Day-4 DS Practicals
No ratings yet
Day-4 DS Practicals
5 pages
Data Preprocessing Example Programs1
No ratings yet
Data Preprocessing Example Programs1
9 pages
III Unit
No ratings yet
III Unit
4 pages
Dsbda Lab - 1 - 1736243987425
No ratings yet
Dsbda Lab - 1 - 1736243987425
10 pages
Data Preprocessing For Machine Learning in Python
No ratings yet
Data Preprocessing For Machine Learning in Python
27 pages
DataAnalytics Lab Manual
No ratings yet
DataAnalytics Lab Manual
35 pages
Task 1
No ratings yet
Task 1
2 pages
TE ML LAB Mannual
No ratings yet
TE ML LAB Mannual
21 pages
01 Apply Data Preprocessing On Heart Dataset and Evaluate Performance Using Confusion Matrix
No ratings yet
01 Apply Data Preprocessing On Heart Dataset and Evaluate Performance Using Confusion Matrix
19 pages
Module 3 Notes
No ratings yet
Module 3 Notes
5 pages
Unit 2 Data Preprocessing
No ratings yet
Unit 2 Data Preprocessing
3 pages
ML Task
No ratings yet
ML Task
4 pages
DATA MINING and MACHINE LEARNING. PREDICTIVE TECHNIQUES: ENSEMBLE METHODS, BOOSTING, BAGGING, RANDOM FOREST, DECISION TREES and REGRESSION TREES.: Examples with MATLAB
From Everand
DATA MINING and MACHINE LEARNING. PREDICTIVE TECHNIQUES: ENSEMBLE METHODS, BOOSTING, BAGGING, RANDOM FOREST, DECISION TREES and REGRESSION TREES.: Examples with MATLAB
César Pérez López
No ratings yet
DATA MINING AND MACHINE LEARNING. PREDICTIVE TECHNIQUES: REGRESSION, GENERALIZED LINEAR MODELS, SUPPORT VECTOR MACHINE AND NEURAL NETWORKS
From Everand
DATA MINING AND MACHINE LEARNING. PREDICTIVE TECHNIQUES: REGRESSION, GENERALIZED LINEAR MODELS, SUPPORT VECTOR MACHINE AND NEURAL NETWORKS
César Pérez López
No ratings yet