0% found this document useful (0 votes)
22 views8 pages

1 - Data Preprocessing and Cleaning - 55

The document outlines an experiment on data preprocessing and cleaning, detailing objectives, methods, and programming assignments related to handling missing values, data transformation, outlier detection, feature engineering, and dimensionality reduction using PCA. The experiment aims to enhance data quality for improved machine learning model performance. It includes practical implementations using Python's Pandas and various machine learning techniques.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
22 views8 pages

1 - Data Preprocessing and Cleaning - 55

The document outlines an experiment on data preprocessing and cleaning, detailing objectives, methods, and programming assignments related to handling missing values, data transformation, outlier detection, feature engineering, and dimensionality reduction using PCA. The experiment aims to enhance data quality for improved machine learning model performance. It includes practical implementations using Python's Pandas and various machine learning techniques.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 8

Title of the Experiment: Data

Preprocessing and Cleaning


Your Name: AYAN PRAMANICK
UID: TNU2022053200055
Instructor’s Name: Dr. Madhu sudan Das
Date of Submission: 2.09.24
Subject Name: Introduction to Machine
Learning
Objective of the Experiment:
In this experiment we were able to understand about data cleaning, how to deal with
missing value in data, transforming data, and feature scaling. By the end of this experiment,
students should be able critically complete the missing data on a data set and apply most of
the transformation techniques on the org. to read the above dataset and be in a position to
explain the impact of such different techniques on the dataset. and based on the training
dataset and on model performance.

Introduction:
Background Information:
Before feeding data into the training algorithm, data preprocessing is a must do step which
aims at improving data quality in order to accomplish the overall goal of getting better
performance of the models. This includes feature pre-processing; managing missing values,
normalization, and feature scaling and data transformation. Otherwise, when building
models, the result is hindered with low quality data while preprocessing makes sure the
dataset is fit for analysis.

Hypothesis:
Pre-processing will enhance the quality of the obtained data and hence increase the
performance of the machine learning techniques. Such approaches as handling missing
values and scaling of features will cause different effects on the model.

Relevance:
Historically, data preprocessing is one of the critical steps in the growth of high-performance
models of machine learning. Preprocessing is an important step in improving model
outcomes because it removes all inconsistencies and converts the data into a standard
format for ‘feeding’ to the models.

Methods:
Assignment 1: Data Cleaning and Handling Missing Values:
1. Load the Dataset
 Use Python’s Pandas library to load a dataset (e.g., a CSV file) that contains
missing values.
2. Identify Missing Values
3. Handle Missing Values
4. Implement various strategies to handle missing data
 Remove Rows/Columns with Missing Data
5. Compare Impact
 See how the dataset changes after handling missing values.
Assignment 2: Data Transformation and Feature Scaling:
1. Load a given dataset containing numerical and categorical features.
2. Perform data transformation tasks such as log transformation, power
transformation, and one-hot encoding.
3. Apply feature scaling techniques like Min-Max Scaling, Standardization, and
Robust Scaling.
4. Evaluate the impact of different transformations and scaling techniques on a
simple machine learning model (e.g., Linear Regression). –not covered

Assignment 3: Outlier Detection and Treatment


1. Load a given dataset with potential outliers.
2. Use statistical methods (e.g., Z-score, IQR) and visualization techniques (e.g.,
box plots, scatter plots) to identify outliers.
3. Implement different methods to treat outliers (e.g., removal, capping, or
transformation).
4. Analyze how the presence of outliers and their treatment affects data
distribution and model performance.

Assignment 4: Data Imputation and Feature Engineering


1. Load a dataset with missing and incomplete data.
2. Implement advanced data imputation techniques such as K-Nearest
Neighbors (KNN) imputation or Multiple Imputation by Chained Equations
(MICE).
3. Perform feature engineering by creating new features from existing ones (e.g.,
combining features, polynomial features, or interaction terms).
4. Evaluate the impact of the new features on model performance.

Assignment 5: Dimensionality Reduction using PCA


1. Load a high-dimensional dataset.
2. Apply PCA to reduce the number of features while retaining as much variance
as possible.
3. Visualize the principal components and explain the variance retained by each
component.
4. Compare the performance of a machine learning model before and after
applying PCA.

Programming
Assignment 1: Data Cleaning and Handling Missing Values
import pandas as pd
import numpy as np

# Load the dataset


df = pd.read_csv('/content/TSLA.csv')
print(df)

# Identify missing values


missing_values = df.isnull().sum()
print(missing_values)

# Replace missing values with mean


df['Close'] = df['Close'].fillna(df['Close'].mean())

# Remove rows with missing values


df = df.dropna()

# Forward fill missing values


df['Close'] = df['Close'].fillna(method='ffill')
print("\nData after forward filling missing values:")
print(df.head())
output:

Assignment 2: Data Transformation and Feature Scaling


import pandas as pd
from sklearn.preprocessing import MinMaxScaler, StandardScaler

# Load the dataset


df = pd.read_csv('/content/TSLA.csv')
print('Your Dataset',df)

# Log transformation
df['Close'] = np.log(df['Close'])

# One-hot encoding
df = pd.get_dummies(df, columns=['Close'])

# Min-Max Scaling
scaler = MinMaxScaler()
df['High'] = scaler.fit_transform(df[['High']])

# Standardization
scaler = StandardScaler()

df['High'] = scaler.fit_transform(df[['High']])
print("\nData after Standardization:")
print(df.head())
Output:

Assignment 3: Outlier Detection and Treatment


import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

# Load the dataset


df = pd.read_csv('/content/TSLA.csv')

# Identify outliers using Z-score


z_scores = np.abs((df['High'] - df['High'].mean()) / df['High'].std())
outliers = df[z_scores > 3]

# Remove outliers
df = df[z_scores <= 3]

# Visualize outliers using box plot


plt.boxplot(df['High'])
plt.show()
Output:

Assignment 4: Data Imputation and Feature Engineering


import pandas as pd
from sklearn.impute import KNNImputer

# Load the dataset


df = pd.read_csv('/content/TSLA.csv')

# KNN imputation
imputer = KNNImputer()
print(df.head())
df['High'] = imputer.fit_transform(df[['High']])

# Create new feature by combining existing features


df['new_feature'] = df['High'] + df['Close']
print("\nData with new feature:")
print(df.head())
Output:

Assignment 5: Dimensionality Reduction using PCA


# Import necessary libraries
from sklearn.datasets import load_iris
import pandas as pd
from sklearn.decomposition import PCA
import matplotlib.pyplot as plt

# Load the Iris dataset


iris = load_iris()
df = pd.DataFrame(data=iris.data, columns=iris.feature_names)
df['target'] = iris.target

# Apply PCA to reduce the number of features


pca = PCA(n_components=2)
pca_df = pca.fit_transform(df.drop('target', axis=1))
pca_df = pd.DataFrame(data=pca_df, columns=['PC1', 'PC2'])
pca_df['target'] = df['target']

# Visualize the principal components


plt.figure(figsize=(8, 6))
for target in pca_df['target'].unique():
plt.scatter(pca_df.loc[pca_df['target'] == target, 'PC1'], pca_df.loc[pca_df['target'] ==
target, 'PC2'])
plt.xlabel('PC1')
plt.ylabel('PC2')
plt.title('Principal Components')
plt.show()

# Print the explained variance ratio


print(pca.explained_variance_ratio_)
Output:

You might also like