0% found this document useful (0 votes)
6 views6 pages

Exp-2 ML

The document outlines the importance of data preprocessing in data analysis and machine learning, detailing various techniques such as data cleaning, transformation, feature selection, integration, and reduction. It emphasizes the need to handle missing values, outliers, and duplicates to improve model performance and accuracy. The document also includes code examples demonstrating these preprocessing techniques using Python and libraries like pandas and scikit-learn.

Uploaded by

Rishit Goel
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
6 views6 pages

Exp-2 ML

The document outlines the importance of data preprocessing in data analysis and machine learning, detailing various techniques such as data cleaning, transformation, feature selection, integration, and reduction. It emphasizes the need to handle missing values, outliers, and duplicates to improve model performance and accuracy. The document also includes code examples demonstrating these preprocessing techniques using Python and libraries like pandas and scikit-learn.

Uploaded by

Rishit Goel
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 6

EXPERIMENT-2

AIM: To study and apply different data preprocessing techniques.

THEORY:

Data preprocessing is a crucial step in the data analysis and machine learning pipeline. Raw data often contains
missing values, inconsistencies, noise, and irrelevant information, which can negatively impact model performance.
Poor data quality can lead to inaccurate predictions, biased insights, and inefficiencies in analysis. Preprocessing
helps transform raw data into a suitable format for analysis, ensuring better accuracy, efficiency, and robustness of
the model. By cleaning, transforming, and reducing data complexity, preprocessing significantly improves model
training and prediction.

Steps in Data Preprocessing -

1. Data Cleaning

Data cleaning involves handling missing values, duplicate entries, and inconsistencies in the dataset. Since
real-world data is rarely perfect, addressing these issues is essential for meaningful analysis.

a) Handling Missing Values - Missing data can arise due to various reasons such as human errors, equipment
malfunctions, or incomplete data collection. Some common techniques to handle missing values include:

●​ Removing Missing Data: If a small number of records have missing values, they can be removed to
maintain data integrity.
●​ Imputation: Filling in missing values using statistical methods such as mean, median, or mode
replacement.
●​ Using Algorithms That Handle Missing Data: Some machine learning algorithms, like decision trees,
can work with missing values directly.

b) Detecting and Handling Outliers - Outliers are extreme values that differ significantly from the majority of data
points and may skew the results. Techniques to detect and handle outliers include:

●​ Statistical Methods: Z-score, Interquartile Range (IQR), and Boxplots.


●​ Transformations: Logarithmic or square root transformations to normalize the data.
●​ Trimming or Capping: Removing or limiting the impact of extreme values.

c) Removing Duplicates - Duplicate records can arise due to data entry errors, merging datasets, or repeated data
extraction. Identifying and removing duplicate entries ensures that the dataset remains clean and does not introduce
bias.

2. Data Transformation - Data transformation involves modifying the data to improve its quality and compatibility
with the analysis or machine learning model.

a) Normalization - Normalization rescales the features so they fall within a specific range, such as [0,1] or [-1,1].
This helps improve the performance of distance-based algorithms like k-Nearest Neighbors (k-NN) and Neural
Networks.
b) Standardization - Standardization transforms features so they have a mean of zero and a standard deviation of
one. This is useful for algorithms such as Support Vector Machines (SVM) and Principal Component Analysis
(PCA), which assume normally distributed data.

3. Feature Selection & Extraction - Feature selection and extraction help improve the efficiency and accuracy of
machine learning models by reducing irrelevant or redundant data.

a) Feature Selection - Feature selection involves selecting only the most relevant features for analysis, reducing
dimensionality and improving model performance. Techniques include:

●​ Correlation Matrix: Identifies and removes highly correlated features.


●​ Variance Threshold: Eliminates features with low variance.
●​ Chi-Square Test: Identifies important categorical features.

b) Feature Extraction - Feature extraction transforms existing features into new dimensions, making the data more
informative while reducing complexity. Examples include:

●​ Principal Component Analysis (PCA): Reduces dimensionality by selecting the most important feature
combinations.
●​ Linear Discriminant Analysis (LDA): Maximizes class separability in classification problems.

4. Data Integration - Data integration involves combining multiple data sources into a unified dataset. It ensures
consistency in format, resolves redundancy, and improves data quality. Common challenges in data integration
include:

●​ Schema Integration: Aligning different database structures.


●​ Entity Resolution: Identifying and merging records referring to the same entity.
●​ Data Cleaning During Merging: Handling missing values and inconsistencies across datasets.

5. Data Reduction - Data reduction helps in reducing the volume of data while preserving important information,
which improves computational efficiency and model performance.

a) Dimensionality Reduction - Reducing the number of features while retaining significant information helps
mitigate the curse of dimensionality and enhances model generalization. Techniques include:

●​ PCA (Principal Component Analysis): Reduces correlated features into uncorrelated principal
components.
●​ LDA (Linear Discriminant Analysis): Reduces dimensions while maximizing class separability.

b) Sampling Techniques - Sampling helps create smaller, representative datasets for faster processing and reduced
storage requirements. Common sampling methods include:

●​ Random Sampling: Selecting a subset of data points randomly.


●​ Stratified Sampling: Ensuring that all data classes are proportionally represented.
CODE:
import pandas as pd

import numpy as np

def remove_outliers(df, columns, method='zscore', threshold=3):

'''

Removes outliers from specified columns using Z-score or IQR method.

'''

if method == 'zscore':

for col in columns:

mean = df[col].mean()

std = df[col].std()

df = df[(df[col] - mean).abs() <= (threshold * std)]

elif method == 'iqr':

for col in columns:

Q1 = df[col].quantile(0.25)

Q3 = df[col].quantile(0.75)

IQR = Q3 - Q1

df = df[(df[col] >= (Q1 - 1.5 * IQR)) & (df[col] <= (Q3 + 1.5 * IQR))]

return df

def remove_duplicates(df):

'''

Removes duplicate rows from the dataset.

'''

return df.drop_duplicates()

def handle_missing_data(df, method='mean', columns=None):

'''

Handles missing values in the dataset using mean, median, mode, or drop method.

'''

if method == 'drop':

df = df.dropna()

else:

if columns is None:

columns = df.columns

for col in columns:

if method == 'mean':

df[col] = df[col].fillna(df[col].mean())
elif method == 'median':

df[col] = df[col].fillna(df[col].median())

elif method == 'mode':

df[col] = df[col].fillna(df[col].mode()[0])

return df

# Creating new dataset

data = {

'X': [10, 20, 30, 40, 50, 60, 700, 80, 90, 100, 110, 120, 800],

'Y': [5, 10, np.nan, 15, 20, 25, np.nan, 30, 35, 40, 45, 50, 55],

'Z': ['cat', 'dog', 'cat', 'dog', 'fish', 'cat', 'dog', 'dog', 'fish', 'cat', 'dog', 'fish', 'fish'],

'W': [2, 4, 6, 8, 10, 12, 2, 4, 6, 8, 10, 12, 2],

'V': [500, 1000, 1500, 500, 1000, 1500, 500, 1000, 1500, 500, 1000, 1500, 500]

df = pd.DataFrame(data)

print("Original Dataset:\n", df)

# Removing outliers from column 'X'

df = remove_outliers(df, columns=['X'], method='zscore')

print("\nAfter Removing Outliers:\n", df)

# Shuffling the data

df = df.sample(frac=1, random_state=42).reset_index(drop=True)

print("\nAfter Shuffling Data:\n", df)

# Removing duplicates

df = remove_duplicates(df)

print("\nAfter Removing Duplicates:\n", df)

# Handling missing data in column 'Y'

df = handle_missing_data(df, method='mean', columns=['Y'])

print("\nAfter Handling Missing Data:\n", df)

# Dimensionality Reduction (Example using PCA)

from sklearn.decomposition import PCA

from sklearn.preprocessing import StandardScaler

# Selecting numerical columns for PCA


numerical_columns = ['X', 'Y', 'W', 'V']

df_numeric = df[numerical_columns]

# Standardizing the data

scaler = StandardScaler()

df_scaled = scaler.fit_transform(df_numeric)

# Applying PCA

pca = PCA(n_components=2)

df_pca = pca.fit_transform(df_scaled)

print("\nAfter Dimensionality Reduction (PCA):\n", df_pca)

# Feature Selection (Example using Correlation Matrix)

corr_matrix = df.corr()

print("\nFeature Correlation Matrix:\n", corr_matrix)

OUTPUT:
Conclusion -

Data preprocessing is an essential step in the data science workflow. It ensures that data is clean, well-structured,
and suitable for analysis. By handling missing values, detecting outliers, selecting relevant features, and reducing
data complexity, preprocessing enhances the accuracy, efficiency, and robustness of analytical models. Properly
preprocessed data leads to better insights, improved decision-making, and higher-performing machine learning
models.

You might also like