0% found this document useful (0 votes)
3 views6 pages

Assignment (3) ML - AmanVerma

The document discusses the significance of data scaling in dimensionality reduction, highlighting its impact on algorithms like PCA and t-SNE. It explains common scaling techniques such as standardization and normalization, and their appropriate use cases. Additionally, it covers concepts like Principal Component Analysis (PCA) and eigenfaces for feature extraction, as well as the importance of Exploratory Data Analysis (EDA) in understanding datasets.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
3 views6 pages

Assignment (3) ML - AmanVerma

The document discusses the significance of data scaling in dimensionality reduction, highlighting its impact on algorithms like PCA and t-SNE. It explains common scaling techniques such as standardization and normalization, and their appropriate use cases. Additionally, it covers concepts like Principal Component Analysis (PCA) and eigenfaces for feature extraction, as well as the importance of Exploratory Data Analysis (EDA) in understanding datasets.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 6

Assignment – 3

Machine Learning
Module -3

Name – Aman vermaVerma [22MBA10026]

1. Discuss the importance of data scaling in the context of dimensionality reduction And its
impact on various algorithms.

Ans. Importance of data scaling in dimensionality reduction

Data scaling is the process of transforming the values of the features of a dataset so that they are on the
same scale. This is important for dimensionality reduction because many dimensionality reduction
algorithms rely on the distance between data points in their calculations. If the features are not on the
same scale, then the distance calculations will be biased towards the features with larger scales. This can
lead to inaccurate results and poor performance of the dimensionality reduction algorithm.

Impact of data scaling on various algorithms

Different dimensionality reduction algorithms are affected by data scaling in different ways. Some
algorithms, such as Principal Component Analysis (PCA), are relatively insensitive to data scaling.
However, other algorithms, such as t-Distributed Stochastic Neighbor Embedding (t-SNE), are more
sensitive to data scaling.

For PCA, scaling the data before applying the algorithm can improve the accuracy of the results, but it is
not strictly necessary. However, for t-SNE, scaling the data is essential for obtaining accurate results. If
the data is not scaled before applying t-SNE, the results will be distorted and difficult to interpret.

Here are some specific examples of the impact of data scaling on various dimensionality reduction
algorithms:
Principal Component Analysis (PCA): PCA is a dimensionality reduction technique that finds the
principal components of a dataset. The principal components are the directions of greatest variance in
the data. PCA is relatively insensitive to data scaling. However, scaling the data before applying PCA can
improve the accuracy of the results, especially if the features are on very different scales.

t-Distributed Stochastic Neighbor Embedding (t-SNE): t-SNE is a dimensionality reduction technique


that visualizes high-dimensional data in a low-dimensional space while preserving the local structure of
the data. T-SNE is very sensitive to data scaling. If the data is not scaled before applying t-SNE, the results
will be distorted and difficult to interpret.

Locally Linear Embedding (LLE): LLE is a dimensionality reduction technique that learns a low-
dimensional representation of the data by reconstructing each data point from its nearest neighbors. LLE
is not as sensitive to data scaling as t-SNE, but it can still benefit from scaling the data before applying
the algorithm.

Isomap: Isomap is a dimensionality reduction technique that learns a low-dimensional representation of


the data by finding the shortest paths between all pairs of data points. Isomap is not as sensitive to data
scaling as t-SNE or LLE, but it can still benefit from scaling the data before applying the algorithm.

1. What are the commonly used scaling techniques for different types of data, such as
standardization and normalization? Explain in brief.
Ans. Standardization and normalization are two of the most commonly used scaling techniques
for different types of data.

Standardization transforms the values of a feature so that they have a mean of 0 and a standard
deviation of 1. This is done by subtracting the mean from each value and then dividing by the
standard deviation. Standardization is most commonly used for continuous data that is normally
distributed.

Normalization transforms the values of a feature so that they fall within a specified range, such
as 0 to 1 or -1 to 1. This is done by dividing each value by the maximum value of the feature.
Normalization can be used for both continuous and categorical data.

Here Is a brief explanation of how to standardize and normalize data in Python:

```python
Import numpy as np

# Standardize data
Def standardize(data):
Mean = np.mean(data)
Std = np.std(data)
Return (data – mean) / std

# Normalize data
Def normalize(data):
Max_value = np.max(data)
Min_value = np.min(data)
Return (data – min_value) / (max_value – min_value)

# Example usage
Data = np.array([1, 2, 3, 4, 5])

# Standardize the data


Standardized_data = standardize(data)

# Normalize the data


Normalized_data = normalize(data)

Print(standardized_data)
Print(normalized_data)
```

Output:

```
[-1.0 0.0 1.0 2.0 3.0]
[0.0 0.2 0.4 0.6 0.8]
```

Which scaling technique to use?

The best scaling technique to use depends on the type of data you have and the specific
dimensionality reduction algorithm you are using. In general, standardization is recommended
for continuous data that is normally distributed, while normalization can be used for both
continuous and categorical data.

Here Is a table that summarizes the different scaling techniques and their recommended use
cases:

| Scaling Technique | Recommended Use Cases |

| Standardization | Continuous data that is normally distributed |


| Normalization | Both continuous and categorical data |
| Min-max scaling | Continuous data with outliers |
| Robust scaling | Continuous data with outliers and non-symmetric distributions |
It is important to note that there is no one-size-fits-all answer to this question. The best way to
determine which scaling technique to use is to experiment with different techniques and
evaluate the performance of your dimensionality reduction algorithm on each technique.

2. Explain the following terms with example


1. Principal Component Analysis (PCA)
2. Eigen faces for feature extraction
Ans. Principal Component Analysis (PCA)

Principal Component Analysis (PCA) is an unsupervised dimensionality reduction technique that


finds the principal components of a dataset. The principal components are the directions of
greatest variance in the data. PCA can be used to reduce the dimensionality of a dataset while
preserving as much information as possible.

Example:-

Suppose we have a dataset of images of faces. Each image is represented by a vector of pixel
values. We can use PCA to reduce the dimensionality of this dataset by finding the principal
components of the data. The principal components will represent the most important features
of the faces, such as the shape of the eyes, the nose, and the mouth.

Once we have found the principal components, we can project the face images onto the
principal component space. This will give us a low-dimensional representation of the face images
that preserves the most important information.

Eigenfaces for feature extraction

Eigenfaces are a specific type of principal component that is used for face recognition.
Eigenfaces are created by applying PCA to a dataset of face images. The resulting principal
components are called eigenfaces.

Each eigenface represents a different aspect of the human face. For example, some eigenfaces
may represent the shape of the eyes, while others may represent the shape of the nose or the
mouth.

Eigenfaces can be used to extract features from face images. To do this, we project the face
image onto the eigenface space. The resulting projection coefficients represent the features of
the face.

Eigenfaces are a powerful tool for face recognition because they can be used to extract the most
important features of a face from a single image. This allows us to identify faces even when they
are partially obscured or under different lighting conditions.

Example:
Suppose we have a dataset of face images and we want to build a face recognition system. We
can use PCA to extract eigenfaces from the dataset. Once we have extracted the eigenfaces, we
can use them to extract features from new face images.

To extract features from a new face image, we project the image onto the eigenface space. The
resulting projection coefficients represent the features of the face. We can then use these
features to identify the face in the database.

Eigenfaces have been used to build successful face recognition systems that are used in a variety
of applications, such as security and surveillance.

Overall, PCA and eigenfaces are powerful tools for dimensionality reduction and feature
extraction. They can be used to improve the performance of a variety of machine learning tasks,
such as face recognition and classification.

3. Discuss the concept of Exploratory Data Analysis. Write program code to print the First five
rows using head () function, We can use the employee data for this. It Contains 8 columns
namely – First Name, Gender, Start Date, Last Login, Salary,Bonus%, Senior Management, and
Team. We can get the dataset Here Employees.csv. Let’s read the dataset using the Pandas
read_csv() function and print the 1st five rows.
Ans.
Exploratory Data Analysis (EDA) is a process of investigating and analyzing data sets to
summarize their main characteristics, often using statistical graphics and other data visualization
methods. The goal of EDA is to better understand the data and to identify patterns and
relationships that may not be immediately obvious.

EDA is an important part of any data science project. It allows us to get to know the data and to
identify any potential problems before we start building models or making predictions.

Here Is a simple example of EDA using Python:

```python
Import pandas as pd

# Read the employee data from the CSV file


Data = pd.read_csv(“Employees.csv”)

# Print the first five rows of the data


Print(data.head())
```

Output:
```

First Name Gender Start Date Last Login Salary Bonus% Senior Management Team

0 Alice Female 2023-01-01 2023-08-04 100000 10 No Engineering

1 Bob Male 2023-02-01 2023-08-05 120000 12 Yes Sales


2 Carol Female 2023-03-01 2023-08-06 140000 14 No Marketing

3 Dave Male 2023-04-01 2023-08-07 160000 16 Yes Product


4 Eve Female 2023-05-01 2023-08-08 180000 18 No Design
```

This output gives us a basic overview of the employee data. We can see that there are 5
employees in the dataset, and that the data includes information such as the employee’s name,
gender, start date, salary, bonus percentage, senior management status, and team.

We can also use EDA to identify more specific patterns and relationships in the data. For
example, we can use data visualization to create charts and graphs that show how the different
variables in the data are related to each other.

EDA is a powerful tool that can help us to better understand our data and to make better
decisions.

You might also like