0% found this document useful (0 votes)
15 views6 pages

PDSLab Manual EXP7

Uploaded by

Tawheed Khan
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
15 views6 pages

PDSLab Manual EXP7

Uploaded by

Tawheed Khan
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 6

BHARATIYA VIDYA BHAVAN’S

SARDAR PATEL INSTITUTE OF TECHNOLOGY


MUNSHI NAGAR, ANDHERI (WEST), MUMBAI – 400 058.
Computer Engineering
Python Programming for Data Science
A. Y. 2024-25

Experiment 7

Aim: Implementation of feature reduction techniques.

Theory:

Introduction to Feature Reduction Techniques:

Feature reduction techniques are essential tools in machine learning and data science, used to
simplify models by reducing the number of input variables while preserving the most relevant
information. These methods enhance computational efficiency, minimize the risk of overfitting,
and improve the interpretability of models. By focusing on the most significant features, these
techniques allow models to perform better, especially when dealing with high-dimensional
datasets. Common approaches include Principal Component Analysis (PCA), Linear
Discriminant Analysis (LDA), and feature selection methods like Recursive Feature
Elimination (RFE). These techniques are widely used across various domains to enhance model
performance and reduce complexity.

Key Features of Feature Reduction Techniques Include:

- Dimensionality Reduction: Reduces the number of features in high-dimensional datasets


while preserving the most informative ones, which improves model accuracy and efficiency.

- Mitigating Overfitting: Simplifying models by reducing features can prevent overfitting,


especially in cases where the model has more features than data points.

- Improved Interpretability: With fewer features, it becomes easier to understand and interpret
the patterns in the data.

- Faster Computation: Fewer features lead to quicker training and prediction times, which is
critical for large datasets or complex models.

Key Feature Reduction Techniques:

1. Principal Component Analysis (PCA):

- PCA is a linear technique that reduces the dimensionality of the data by finding a new set of
orthogonal features (principal components) that capture the most variance in the data.

- Use case: Dimensionality reduction for visualization, noise reduction, and improving model
efficiency.

SPIT/CE/PDS/EXP7 Page No.1


2. Linear Discriminant Analysis (LDA):

- LDA reduces the feature space by focusing on maximizing the separation between different
classes. It finds a lower-dimensional space that best separates the target classes.

- Use case: Commonly used for supervised classification tasks with multiple classes.

3.Singular Value Decomposition (SVD)

- It is a powerful technique in linear algebra that is used for dimensionality reduction, data
compression, and feature extraction. It decomposes a matrix into three other matrices, revealing
its fundamental structure. In machine learning, SVD is often applied to reduce the number of
features in a dataset while preserving important information.

How SVD Works:

Given a matrix AA of shape m×nm×n, SVD decomposes it into three matrices:


A=UΣVTA=UΣVT Where:

● UU: An m×mm×m orthogonal matrix. Each column of UU represents a left singular


vector.
● ΣΣ: An m×nm×n diagonal matrix where the diagonal values are called singular values.
These values indicate the importance (or magnitude) of the corresponding singular
vectors.
● VTVT: The transpose of an n×nn×n orthogonal matrix. Each row of VTVT (or column
of VV) is a right singular vector.

Basic Syntax and Operations for Feature Reduction:

1. Principal Component Analysis (PCA):

# Importing necessary libraries


from sklearn.decomposition import PCA # PCA for feature reduction
import numpy as np # NumPy for handling data arrays

# Sample data: 100 samples, each with 5 features


# Generating random data with 100 rows and 5 columns
data = np.random.rand(100, 5)

# Applying PCA to reduce the number of features to 2


# PCA will find the two principal components that capture the most variance in the data
pca = PCA(n_components=2)

# Transforming the data by projecting it onto the two principal components


reduced_data = pca.fit_transform(data)

# Output the explained variance ratio of the two principal components


# This shows how much variance in the data is explained by each of the two components
print(pca.explained_variance_ratio_)

SPIT/CE/PDS/EXP7 Page No.2


2. Linear Discriminant Analysis (LDA):

# Importing necessary libraries


from sklearn.discriminant_analysis import LinearDiscriminantAnalysis as LDA
# LDA for dimensionality reduction
from sklearn.datasets import load_iris # Loading a sample dataset (Iris dataset)

# Loading the Iris dataset


# X represents the features (sepal length, sepal width, petal length, petal width)
# y represents the target labels (species of iris flowers)
iris = load_iris()
X, y = iris.data, iris.target

# Applying LDA to reduce the data to 2 components


# LDA will find two linear combinations of the features that best separate the classes
lda = LDA(n_components=2)

# Transforming the data by projecting it onto the two linear discriminants


lda_data = lda.fit_transform(X, y)

# Displaying the reduced data


# This will print the transformed data with two components for each sample
print(lda_data)

3)Singular Value Decomposition (SVD)

# Importing necessary libraries


from sklearn.decomposition import TruncatedSVD # SVD for dimensionality reduction
import numpy as np # NumPy for handling data arrays

# Sample data: 100 samples, each with 5 features


# Generating random data with 100 rows and 5 columns
data = np.random.rand(100, 5)

# Applying SVD to reduce the number of features to 2


# TruncatedSVD is used here to compute a low-rank approximation of the data
svd = TruncatedSVD(n_components=2)

# Transforming the data by projecting it onto the two singular vectors


reduced_data = svd.fit_transform(data)

# Output the explained variance ratio


# This shows how much variance in the data is captured by each of the selected components
print(svd.explained_variance_ratio_)

Overfitting and Underfitting: A Detailed Explanation

In machine learning, the goal is to build models that generalize well to unseen data. However, if
a model is too complex or too simple, it may suffer from overfitting or underfitting, which are
two common pitfalls in model training. Understanding these concepts is key to achieving a
balance between a model’s performance on the training data and its ability to generalize to new
data.

SPIT/CE/PDS/EXP7 Page No.3


1. Overfitting

Definition:

Overfitting occurs when a model is too complex and learns not only the underlying patterns in
the training data but also the noise and irrelevant details. As a result, the model performs very
well on the training data but poorly on unseen test data. This happens because the model fits
the data too closely, capturing random fluctuations that do not represent the actual data
distribution.

Causes of Overfitting:

- Too many features: If there are more features than necessary, the model may pick up on
irrelevant relationships in the training data.

- High model complexity: Complex models (like deep neural networks or very deep decision
trees) have a large number of parameters, which can cause them to memorize the training data.

- Insufficient training data: When the dataset is too small, the model has fewer examples to
learn from, leading it to capture noise or spurious patterns.

- Excessive training: Training a model for too many epochs can lead to overfitting, as the
model starts learning minor variations in the data instead of general patterns.

Symptoms of Overfitting:

- High training accuracy, low test accuracy: The model shows high performance on training
data but performs poorly on validation/test data.

- Large gap between training and validation error: The training error keeps decreasing, but the
validation error stops decreasing (or starts increasing).

Example of Overfitting:

Imagine fitting a highly complex curve (with many polynomial terms) to data points that follow
a simpler underlying trend. The model fits every data point perfectly, including noise or
outliers, but fails to predict new data points accurately.

How to Prevent Overfitting:

- Cross-validation: Use techniques like k-fold cross-validation to ensure that the model
generalizes well.

- Regularization: Techniques like L1 (Lasso) and L2 (Ridge) regularization add penalties to the
model's complexity to prevent overfitting.

- Reduce model complexity: Simplify the model by using fewer features or parameters (e.g.,
pruning a decision tree).

- Early stopping: Stop the training process when the validation error starts increasing, even if
the training error keeps decreasing.

- Ensemble methods: Use techniques like bagging (e.g., Random Forests) or boosting (e.g.,
XGBoost) to combine multiple models and reduce overfitting.

- Data augmentation: In cases where data is limited, artificially increase the training set by
creating variations of the existing data (commonly used in image recognition tasks).
SPIT/CE/PDS/EXP7 Page No.4
2. Underfitting

Definition:

Underfitting occurs when a model is too simple and fails to capture the underlying patterns in
the training data. The model performs poorly on both training and test data because it cannot
represent the data well. This typically happens when the model is not complex enough to
account for the variability in the data.

Causes of Underfitting:

- Too few features: The model doesn’t have enough features to capture the relationships in the
data.

- High bias: Simple models, like linear regression for complex data, make strong assumptions
about the data (e.g., linearity) and may not capture more intricate patterns.

- Undertrained model: The model hasn’t been trained long enough to learn from the data, or the
algorithm used may not be appropriate for the complexity of the problem.

- Wrong choice of model: Using a model that is inherently too simple for the task, such as
applying linear regression to data that requires a non-linear model.

Symptoms of Underfitting:

- High training error and high test error: Both the training and test errors are large because the
model is too simple to learn the data’s structure.

- Low variance, high bias: The model has little flexibility to adapt to the data and hence has a
strong bias (predefined assumption about the data).

Example of Underfitting:

Suppose you have data that follows a non-linear pattern (like a curve), but you fit a straight line
(linear regression) to it. The model will not capture the underlying trend and will perform
poorly even on the training data.

How to Prevent Underfitting:

- Increase model complexity: Use a more complex model or add more features to the model
(e.g., use polynomial regression instead of linear regression for non-linear data).

- Remove bias: Try using models with lower bias, such as decision trees or neural networks,
which have more flexibility in fitting the data.

- Train longer: Ensure that the model is trained sufficiently, allowing it to learn the patterns in
the data.

- Feature engineering: Create new features or transformations of existing ones to better


represent the underlying data patterns.

3. Balancing Overfitting and Underfitting (Bias-Variance Tradeoff)

To achieve good model performance, we need to find the right balance between overfitting and
underfitting. This balance is often referred to as the bias-variance tradeoff:

- High bias (underfitting) occurs when the model is too simplistic and makes strong
SPIT/CE/PDS/EXP7 Page No.5
assumptions about the data (e.g., assuming a linear relationship when the data is non-linear).

- High variance (overfitting) occurs when the model is too flexible and captures noise in the
data, leading to poor generalization to new data.

The goal is to build a model that has low bias and low variance by tuning the complexity of the
model and using techniques like regularization, cross-validation, or ensemble learning to ensure
good generalization.

Visualization of Overfitting and Underfitting

Imagine plotting a model's performance as it grows in complexity:

1. Underfitting: The model is too simple and cannot capture patterns in the data.

2. Good fit: The model is just complex enough to capture the patterns without fitting noise.

3. Overfitting: The model is too complex and fits the training data perfectly, but performs
poorly on new data.

The ideal model is somewhere in between, where it is complex enough to learn the data but not
so complex that it learns noise.

Conclusion: This lab manual provides a thorough introduction to key techniques in


feature reduction, including methods like PCA, LDA, and SVD. It covers fundamental
concepts, such as reducing dimensionality while preserving essential data patterns,
optimizing model performance, and addressing the curse of dimensionality. Through
practical examples and exercises, you will gain a strong understanding of how to
effectively apply these techniques to simplify datasets without losing important
information. This knowledge will equip you with the skills to enhance model efficiency,
improve interpretability, and tackle complex problems in data science, machine
learning, and business analytics.

** In conclusion, students are expected to write their own individual learnings from the said
experiments.

SPIT/CE/PDS/EXP7 Page No.6

You might also like