0% found this document useful (0 votes)

113 views

Dimensionality Reduction Algorithms

The document summarizes dimensionality reduction algorithms. It begins with an introduction to machine learning and the different types including supervised, unsupervised, semi-supervised and reinforcement learning. It then explains how each type works, such as supervised learning using labelled training data, unsupervised learning finding patterns without labels, and reinforcement learning using rewards/punishments. The document goes on to discuss specific dimensionality reduction algorithms like principal component analysis, independent component analysis, and t-distributed stochastic neighbor embedding. It concludes with applications of these algorithms.

Uploaded by

Srimanth Reddy

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

113 views

Dimensionality Reduction Algorithms

Uploaded by

Srimanth Reddy

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

You are on page 1/ 34

VIGNAN INSTITUTE OF TECHNOLOGY & SCIENCE

(NBA Accredited & Affiliated to Jawaharlal Nehru Technological University, Hyderabad)

Deshmukhi (v), Pochampally (M), Yadagdri - Bhuvanagiri District, Telangana-508284

Department of Computer Science and Engineering

A Technical Seminar Report

Dimensionality Reduction Algorithms

Submitted By:

P.V.S.S.K KASHYAP

18891A05A5

1
VIGNAN INSTITUTE OF TECHNOLOGY & SCIENCE
(NBA Accredited & Affiliated to Jawaharlal Nehru Technological University, Hyderabad)
Deshmukhi (v), Pochampally (M), Yadagdri - Bhuvanagiri District, Telangana-508284

Department of Computer Science and Engineering

Vision
To emerge as a premier center for education and research in computer science and engineering
and in transforming students into innovative professionals of contemporary and future technologies to
cater to the global needs of human resources for IT and ITES companies.

Mission
 To produce excellent computer science professionals by imparting quality training,
hands-on-experience and value based education.
 To strengthen links with industry through collaborative partnerships in research &
product development and student internships.
 To promote research based projects and activities among the students in the emerging
areas of technology.
 To explore opportunities for skill development in the application of computer science
among rural and underprivileged population.

2
Program Educational Objectives
● To create and sustain a community of learning in which students acquire knowledge and
apply in their concerned fields with due consideration for ethical, ecological, and
economic issues.
● To provide knowledge based services so as to meet the needs of the society and industry.
● To make the students understand, design and implement the concepts in multiple arenas.
● To educate the students in disseminating the research findings with good soft skills so as
to become successful entrepreneurs.

3
PREFACE

I have made this report file on the Dimensionality Reduction Algorithms. I

have tried my best to elucidate all the relevant detail to the topic to be included in the

report. While in the beginning I have tried to give a general view about this topic.

4
Table of Contents

TITLE PAGE NO.

I. ACKNOWLEDGEMENT 6
II. CERTIFICATE 7

1. MACHINE LEARNING 8

2. TYPES OF MACHINE LEARNING 9

3. WORKING 10

4. USAGE 12

5. ADVANTAGES AND DISADVANTAGES 13

6. PRINCIPLE COMPONENT ALALYSIS 16

7. INDEPENDENT COMPONENT ANALYSIS 18

8. METHODS BASED ON PROJECTIONS 20

9. DISTRIBUTED STOCHASTIC NEIGHBOR EMBEDDING 22

10. APPLICATIONS 24

11. CONCLUSION 31

12. FUTURE SCOPE 32

13. REFERENCES 33

5
ACKNOWLEDGEMENT

I would like to thank respected, Head of the Department for giving me such a wonderful
opportunity to expand my knowledge for my own branch and giving me guidelines to present a
seminar report. It helped me a lot to realize of what we study for.
I would like to thank respected technical seminar coordinator for organizing seminars with the
continuous support.
Secondly, I would like to thank my parents who patiently helped me as I went through my work
and helped to modify and eliminate some of the irrelevant or unnecessary stuff.

Thirdly, I would like to thank my friends who helped me to make my work more organized and
well-stacked till the end.

Last but not the least, I would thank the Almighty for giving me strength to complete my report
on time.

Name and signature of the student

6
CERTIFICATE

This is to certify that the Technical Seminar Report entitled “Dimensionality

Reduction Algorithms” is being submitted by P.V.S.S.K.KASHYAP bearing R. No
18891A05A5 in IV B.Tech I Semester Computer Science and Engineering is a record
bonafide work carried out by her.

Coordinator
Technical Seminar Head of the Department

7
1. MACHINE LEARNING

What is machine learning?

Machine learning (ML) is a type of artificial intelligence (AI) that allows software applications
to become more accurate at predicting outcomes without being explicitly programmed to do so.
Machine learning algorithms use historical data as input to predict new output values.

Recommendation engines are a common use case for machine learning. Other popular uses
include fraud detection, spam filtering, malware threat detection, business process automation
(BPA) and predictive maintenance.

Why is machine learning important?

Machine learning is important because it gives enterprises a view of trends in customer
behaviour and business operational patterns, as well as supports the development of new
products. Many of today's leading companies, such as Facebook, Google and Uber, make
machine learning a central part of their operations. Machine learning has become a significant
competitive differentiator for many companies.

8
2. TYPES OF MACHINE LEARNING

Classical machine learning is often categorized by how an algorithm learns to become more
accurate in its predictions. There are four basic approaches: supervised learning, unsupervised
learning, semi-supervised learning and reinforcement learning. The type of algorithm data
scientists choose to use depends on what type of data they want to predict.

Supervised learning: In this type of machine learning, data scientists supply algorithms with
labelled training data and define the variables they want the algorithm to assess for correlations.
Both the input and the output of the algorithm is specified.
Unsupervised learning: This type of machine learning involves algorithms that train on
unlabeled data. The algorithm scans through data sets looking for any meaningful connection.
The data that algorithms train on as well as the predictions or recommendations they output are
predetermined.
Semi-supervised learning: This approach to machine learning involves a mix of the two
preceding types. Data scientists may feed an algorithm mostly labeled training data, but the
model is free to explore the data on its own and develop its own understanding of the data set.
Reinforcement learning: Data scientists typically use reinforcement learning to teach a machine
to complete a multi-step process for which there are clearly defined rules. Data scientists
program an algorithm to complete a task and give it positive or negative cues as it works out how
to complete a task. But for the most part, the algorithm decides on its own what steps to take
along the way.

9
3. WORKING

How does supervised machine learning work?

Supervised machine learning requires the data scientist to train the algorithm with both labelled
inputs and desired outputs. Supervised learning algorithms are good for the following tasks:
Binary classification: Dividing data into two categories.
Multi-class classification: Choosing between more than two types of answers.
Regression modelling: Predicting continuous values.
Ensembling: Combining the predictions of multiple machine learning models to produce an
accurate prediction.

How does unsupervised machine learning work?

Unsupervised machine learning algorithms do not require data to be labelled. They sift through
unlabeled data to look for patterns that can be used to group data points into subsets. Most types
of deep learning, including neural networks, are unsupervised algorithms. Unsupervised learning
algorithms are good for the following tasks:
Clustering: Splitting the dataset into groups based on similarity.
Anomaly detection: Identifying unusual data points in a data set.
Association mining: Identifying sets of items in a data set that frequently occur together.
Dimensionality reduction: Reducing the number of variables in a data set.

How does semi-supervised learning work?

Semi-supervised learning works by data scientists feeding a small amount of labelled training
data to an algorithm. From this, the algorithm learns the dimensions of the data set, which it can

10
then apply to new, unlabeled data. The performance of algorithms typically improves when they
train on labelled data sets. But labelling data can be time consuming and expensive. Semi-
supervised learning strikes a middle ground between the performance of supervised learning and
the efficiency of unsupervised learning. Some areas where semi-supervised learning is used
include:
Machine translation: Teaching algorithms to translate language based on less than a full
dictionary of words.
Fraud detection: Identifying cases of fraud when you only have a few positive examples.
Labelling data: Algorithms trained on small data sets can learn to apply data labels to larger sets
automatically.

How does reinforcement learning work?

Reinforcement learning works by programming an algorithm with a distinct goal and a
prescribed set of rules for accomplishing that goal. Data scientists also program the algorithm to
seek positive rewards -- which it receives when it performs an action that is beneficial toward the
ultimate goal -- and avoid punishments -- which it receives when it performs an action that gets it
farther away from its ultimate goal. Reinforcement learning is often used in areas such as:
Robotics: Robots can learn to perform tasks the physical world using this technique.
Video gameplay: Reinforcement learning has been used to teach bots to play a number of video
games.
Resource management: Given finite resources and a defined goal, reinforcement learning can
help enterprises plan out how to allocate resources.

11
4. USAGE

Who's using machine learning and what's it used for?

Today, machine learning is used in a wide range of applications. Perhaps one of the most well-
known examples of machine learning in action is the recommendation engine that powers
Facebook's news feed.

Facebook uses machine learning to personalize how each member's feed is delivered. If a
member frequently stops to read a particular group's posts, the recommendation engine will start
to show more of that group's activity earlier in the feed.

Behind the scenes, the engine is attempting to reinforce known patterns in the member's online
behaviors. Should the member change patterns and fail to read posts from that group in the
coming weeks, the news feed will adjust accordingly.

In addition to recommendation engines, other uses for machine learning include the following:
Customer relationship management. CRM software can use machine learning models to analyse
email and prompt sales team members to respond to the most important messages first. More
advanced systems can even recommend potentially effective responses.
Business intelligence. BI and analytics vendors use machine learning in their software to identify
potentially important data points, patterns of data points and anomalies.
Human resource information systems. HRIS systems can use machine learning models to filter
through applications and identify the best candidates for an open position.

12
Self-driving cars. Machine learning algorithms can even make it possible for a semi-autonomous
car to recognize a partially visible object and alert the driver.
Virtual assistants. Smart assistants typically combine supervised and unsupervised machine
learning models to interpret natural speech and supply context.

5. ADVANTAGES AND DISADVANTAGES

What are the advantages and disadvantages of machine learning?

Machine learning has seen use cases ranging from predicting customer behaviors to forming the
operating system for self-driving cars.

When it comes to advantages, machine learning can help enterprises understand their customers
at a deeper level. By collecting customer data and correlating it with behaviors over time,
machine learning algorithms can learn associations and help teams tailor product development
and marketing initiatives to customer demand.

Some companies use machine learning as a primary driver in their business models. Uber, for
example, uses algorithms to match drivers with riders. Google uses machine learning to surface
the ride advertisements in searches.

But machine learning comes with disadvantages. First and foremost, it can be expensive.
Machine learning projects are typically driven by data scientists, who command high salaries.
These projects also require software infrastructure that can be expensive.

There is also the problem of machine learning bias. Algorithms trained on data sets that exclude
certain populations or contain errors can lead to inaccurate models of the world that, at best, fail
and, at worst, are discriminatory. When an enterprise bases core business processes on biased
models it can run into regulatory and reputational harm.

13
How to choose the right machine learning model
The process of choosing the right machine learning model to solve a problem can be time
consuming if not approached strategically.

Step 1: Align the problem with potential data inputs that should be considered for the solution.
This step requires help from data scientists and experts who have a deep understanding of the
problem.

Step 2: Collect data, format it and label the data if necessary. This step is typically led by data
scientists, with help from data wranglers.

Step 3: Chose which algorithm(s) to use and test to see how well they perform. This step is
usually carried out by data scientists.

Step 4: Continue to fine tune outputs until they reach an acceptable level of accuracy. This step
is usually carried out by data scientists with feedback from experts who have a deep
understanding of the problem.

Importance of human interpretable machine learning

Explaining how a specific ML model works can be challenging when the model is complex.
There are some vertical industries where data scientists have to use simple machine learning
models because it's important for the business to explain how every decision was made. This is
especially true in industries with heavy compliance burdens such as banking and insurance.

Complex models can produce accurate predictions, but explaining to a lay person how an output
was determined can be difficult.

What is the future of machine learning?

14
While machine learning algorithms have been around for decades, they've attained new
popularity as artificial intelligence has grown in prominence. Deep learning models, in
particular, power today's most advanced AI applications.

Machine learning platforms are among enterprise technology's most competitive realms, with
most major vendors, including Amazon, Google, Microsoft, IBM and others, racing to sign
customers up for platform services that cover the spectrum of machine learning activities,
including data collection, data preparation, data classification, model building, training and
application deployment.

As machine learning continues to increase in importance to business operations and AI becomes

more practical in enterprise settings, the machine learning platform wars will only intensify.

Continued research into deep learning and AI is increasingly focused on developing more
general applications. Today's AI models require extensive training in order to produce an
algorithm that is highly optimized to perform one task. But some researchers are exploring ways
to make models more flexible and are seeking techniques that allow a machine to apply context
learned from one task to future, different tasks.

15
6. Principal Component Analysis

Principal Component Analysis is an unsupervised learning algorithm that is used for the
dimensionality reduction in machine learning. It is one of the popular tools that is used for
exploratory data analysis and predictive modeling. It is a technique to draw strong patterns from
the given dataset by reducing the variances.

Some common terms used in PCA:

Dimensionality: It is the number of features or variables present in the given dataset

Correlation: It signifies that how strongly two variables are related to each other.

Orthogonal: It defines that variables are not correlated to each other, and hence the correlation
between the pair of variables is zero.

Covariance Matrix: A matrix containing the covariance between the pair of variables is called
the Covariance Matrix.

Step for PCA algorithm:

1. Getting the dataset: we need to take the input dataset and divide it into two subparts X
and Y, where X is the training set, and Y is the validation set.
2. Representing data into a structure: In the 2nd step we will represent our dataset into a
structure.

16
3. Standardizing the data: In this step, we will standardize our dataset. Such as in a
particular column, the features with high variance are more important compared to the
features with lower variance.
4. Standardizing the data: In this step, we will standardize our dataset. Such as in a
particular column, the features with high variance are more important compared to the
features with lower variance.
5. Calculating the Covariance of Z: To calculate the covariance of Z, we will take the
matrix Z, and will transpose it. After transpose, we will multiply it by Z. The output
matrix will be the Covariance matrix of Z.
6. Calculating the Eigen Values and Eigen Vectors: we will calculate the eigenvalues
and eigenvectors for the resultant covariance matrix Z
7. Calculating the new features Or Principal Components : Here we will calculate the
new features. To do this, we will multiply the P* matrix to the Z. In the resultant matrix
Z*, each observation is the linear combination of original features. Each column of the
Z* matrix is independent of each other.
8. Remove less or unimportant features from the new dataset: The new feature set has
occurred, so we will decide here what to keep and what to remove. It means, we will only
keep the relevant or important features in the new dataset, and unimportant features will
be removed out.

Applications of PCA

 PCA is mainly used as the dimensionality reduction technique in various AI applications

such as computer vison, image compression.
 It can also be used for finding hidden patterns if data has high dimensions. Some fields
where PCA is used are Finance, data mining, Psychology, etc.

17
7. Independent Component Analysis

Independent Component Analysis (ICA) is a technique in statistics used to detect hidden factors

that exist in datasets of random variables, signals, or measurements.

Working of ICA

The standard problem used to describe ICA is the “Cocktail Party Problem”. In its simplest form,
imagine two people having a conversation at a cocktail party (like the red and blue speakers
above). For whatever reason, you have two microphones placed near both party-goers (like the
purple and pink microphones above). Both voices are heard by both microphones at different
volumes based on the distance between the person and microphone. In other words, we record
two files that include audio from the two party-goers mixed together. The problem then is, how
can we separate them?

18
This problem is solved easily with Independent Component Analysis (ICA) which transforms
a set of vectors into a maximally independent set. Returning to our “Cocktail Party Problem”,
ICA will convert the two mixed audio recordings (represented by purple and pink waveforms
below) into two unmixed recordings of each individual speaker (represented by blue and red
waveforms below). Notice, that the number of inputs and outputs are the same, and since the
outputs are mutually independent there is no obvious way to drop components like in Principal
Component Analysis (PCA).

Applications of ICA

 Image processing: A method that recognizes and outlines the hidden factors in
multivariate signals, ICA has significantly transformed the field of image processing,

image enhancement, and digital signal processing.

19
 Image De-Nosing: Using different methods it will remove most of the Noise that an
image has accumulated while capturing and it will enhance the image quality.
 Handling incomplete data: handling missing data does not allow PCA to be effective.
Perhaps ICA is the sole solution to fill in missing data. Moreover, ICA can also be seen
as one of the data mining tools when it comes to handling incomplete data.
8. Methods used on projections
 Long Short-Term Memory (LSTM)
 ARIMA
 Comparing Models

Long Short-Term Memory (LSTM): LSTM is a type of recurrent neural network that is
particularly useful for making predictions with sequential data.

ARIMA: The ARIMA model looks slightly different than the models above. We use the stats
models SARIMAX package to train the model and generate dynamic predictions. The SARIMA
model breaks down into a few parts.

 AR: represented as p, is the autoregressive model

 I : represented as d, is the differencing term

 MA: represented as q, is the moving average model

 S: enables us to add a seasonal component

20
Comparing Models: To compare model performance, we will look at root mean squared error
(RMSE) and mean absolute error (MAE). These measurements are both commonly used for
comparing model performance, but they have slightly different intuition and mathematical
meaning.

 MAE: the mean absolute error tells us on average how far our predictions are from
the true value.

 RMSE: we calculate RMSE by taking the square root of the sum of all of the squared
errors.

21
9. t-Distributed Stochastic Neighbor Embedding (t-SNE)

 An unsupervised, randomized algorithm, used only for visualization.

 Applies a non-linear dimensionality reduction technique where the focus is on keeping the
very similar data points close together in lower-dimensional space.

 Outliers do not impact t-SNE

 Preserves the local structure of the data using student t-distribution to compute the
similarity between two points in lower-dimensional space.

Working:

Step 1: Find the pairwise similarity between nearby points in a high dimensional space.

22
Step 2: Map each point in high dimensional space to a low dimensional map based on the
pairwise similarity of points in the high dimensional space.

Step 3: Find a low-dimensional data representation that minimizes the mismatch between Pᵢⱼ and
qᵢⱼ using gradient descent based on Kullback-Leibler divergence (KL Divergence)

UMAP
: UMAP (Uniform Manifold Approximation and Projection) is a novel manifold learning
technique for dimension reduction. UMAP is constructed from a theoretical framework based in
Riemannian geometry and algebraic topology. The result is a practical scalable algorithm that
applies to real world data

FACTS:

 UMAP outperformed t-SNE and PCA, if we look at the 2d and 3d plot, we can see mini-
clusters that are being separated well. It is very effective for visualizing clusters or groups
of data points and their relative proximities
 We know that UMAP is faster than tSNE when it concerns
 large number of data points,

23
 number of embedding dimensions greater than 2 or 3.
 large number of ambient dimensions in the data set.

10.Applications of Dimensionality Reduction Algorithms

 A dimensionality reduction technique that is sometimes used in neuroscience is maximally
informative dimensions, [citation needed] which finds a lower-dimensional representation
of a dataset such that as much information as possible about the original data is preserved.

Common techniques of Dimensionality Reduction

 Principal Component Analysis

 Backward Elimination
 Forward Selection
 Score comparison
 Missing Value Ratio
 Low Variance Filter
 High Correlation Filter
 Random Forest
 Factor Analysis
 Auto-Encoder

Principal Component Analysis

The purpose of this post is to provide a complete and simplified explanation of Principal
Component Analysis (PCA). We'll cover how it works step by step, so everyone can understand
it and make use of it, even those without a strong mathematical background. PCA is a widely
covered method on the web, and there are some great articles about it, but many spend too much
time in the weeds on the topic, when most of us just want to know how it works in a simplified
way. Principal component analysis can be broken down into five steps. I'll go through each step,
providing logical explanations of what PCA is doing and simplifying mathematical concepts

24
such as standardization, covariance, eigenvectors and eigenvalues without focusing on how to
compute them. Principal Component Analysis, or PCA, is a dimensionality-reduction method
that is often used to reduce the dimensionality of large data sets, by transforming a large set of
variables into a smaller one that still contains most of the information in the large set. Reducing
the number of variables of a data set naturally comes at the expense of accuracy, but the trick in
dimensionality reduction is to trade a little accuracy for simplicity. Because smaller data sets are
easier to explore and visualize and make analyzing data much easier and faster for machine
learning algorithms without extraneous variables to process. So to sum up, the idea of PCA is
simple — reduce the number of variables of a data set, while preserving as much information as
possible

Application:

Applications of Principal Component Analysis. PCA is predominantly used as a dimensionality

reduction technique in domains like facial recognition, computer vision and image compression.
It is also used for finding patterns in data of high dimension in the field of finance, data mining,
bioinformatics, psychology, etc.

Backward Elimination

Backward elimination is a feature selection technique while building a machine learning model.
It is used to remove those features that do not have a significant effect on the dependent variable
or prediction of output. There are various ways to build a model in Machine Learning, which are:
All-in. Backward Elimination.

It is a stepwise regression approach that begins with a full (saturated) model and at each step
gradually eliminates variables from the regression model to find a reduced model that best
explains the data. Also known as Backward Elimination regression.

Forward selection is a type of stepwise regression which begins with an empty model and adds in
variables one by one. ... It is one of two commonly used methods of stepwise regression; the
other is backward elimination, and is almost opposite .The overall percentage of data that is

25
missing is important. Generally, if less than 5% of values are missing then it is acceptable to
ignore them (REF). However, the overall percentage missing alone is not enough; you also need
to pay attention to which data is missing.

Missing Value Ratio

Missing data is defined as the values or data that is not stored (or not present) for some variable/s
in the given dataset. Below is a sample of the missing data from the Titanic dataset. You can see
the columns 'Age' and 'Cabin' have some missing values.

Ways to Handle Missing Values in Machine Learning

1) Deleting Rows with missing values.

2) Impute missing values for continuous variable.

3) Impute missing values for categorical variable.

4) Other Imputation Methods.

5) Using Algorithms that support missing values.

What is variance in machine learning?

What is variance in machine learning? Variance refers to the changes in the model when using
different portions of the training data set. Simply stated, variance is the variability in the model
prediction—how much the ML function can adjust depending on the given data set.

Low Variance Filter

Low Variance Filter is a useful dimensionality reduction algorithm. ... The variance is a
statistical measure of the amount of variation in the given variable. If the variance is too low, it
means that it does not change much and hence it can be ignored.

26
Why do we use a low variance filter?

Filters out double-compatible columns, whose variance is below a user defined threshold.
Columns with low variance are likely to distract certain learning algorithms (in particular those
which are distance based) and are therefore better removed.

A small variance indicates that the data points tend to be very close to the mean, and to each
other. A high variance indicates that the data points are very spread out from the mean, and from
one another. Variance is the average of the squared distances from each point to the mean.

A large variance indicates that numbers in the set are far from the mean and far from each other.
A small variance, on the other hand, indicates the opposite. A variance value of zero, though,
indicates that all values within a set of numbers are identical.

High Correlation Filter

This dimensionality reduction algorithm tries to discard inputs that are very similar to others. In
simple words, if your opinion is same as your boss, one of you is not required. If the value of two
input parameters is always the same, it means they represent the same entity. Then we do not
need two parameters there. Just one should be enough. In technical words, if there is a very high
correlation between two input variables, we can safely drop one of them. High Correlation filter:
A pair of variables having high correlation increases multicollinearity in the dataset. So, we can
use this technique to find highly correlated features and drop them accordingly.

The corr() method can be used to identify the correlation between the fields. Of course, before
we start we have to choose only the numeric fields as the corr() method works only with the
numeric fields. We can have a high correlation between non-numeric fields. But this method
works only on numeric fields.

High correlation between two variables means they have similar trends and are likely to carry
similar information. This can bring down the performance of some models drastically (linear and
logistic regression models, for instance). We can calculate the correlation between independent
numerical variables that are numerical in nature. If the correlation coefficient crosses a certain

27
threshold value, we can drop one of the variables (dropping a variable is highly subjective and
should always be done keeping the domain in mind).

Random Forest

Random Forest is one of the most widely used algorithms for feature selection. It comes
packaged with in-built feature importance so you don’t need to program that separately. This
helps us select a smaller subset of features.We need to convert the data into numeric form by
applying one hot encoding, as Random Forest (Scikit-Learn Implementation) takes only numeric
inputs. Let’s also drop the ID variables (Item_Identifier and Outlet_Identifier) as these are just
unique numbers and hold no significant importance for us currently.

Random forest is a Supervised Machine Learning Algorithm that is used widely in Classification
and Regression problems. It builds decision trees on different samples and takes their majority
vote for classification and average in case of regression. ... It performs better results for
classification problems.

A random forest is a machine learning technique that's used to solve regression and classification
problems. ... A random forest eradicates the limitations of a decision tree algorithm. It reduces
the overfitting of datasets and increases precision.

Random Forest is a supervised machine learning algorithm made up of decision trees. Random
Forest is used for both classification and regression—for example, classifying whether an email
is “spam” or “not spam”

The random forest is a classification algorithm consisting of many decisions trees. It uses
bagging and feature randomness when building each individual tree to try to create an
uncorrelated forest of trees whose prediction by committee is more accurate than that of any
individual tree.

Factor Analysis

Factor analysis is one of the unsupervised machine learning algorithms which is used for
dimensionality reduction. This algorithm creates factors from the observed variables to represent
the common variance i.e. variance due to correlation among the observed variables.

28
Factor analysis is a powerful data reduction technique that enables researchers to investigate
concepts that cannot easily be measured directly. By boiling down a large number of variables
into a handful of comprehensible underlying factors, factor analysis results in easy-to-
understand, actionable data.

PCA, short for Principal Component Analysis, and Factor Analysis, are two statistical methods
that are often covered together in classes on Multivariate

There are two types of factor analyses, exploratory and confirmatory. Exploratory factor analysis
(EFA) is method to explore the underlying structure of a set of observed variables, and is a
crucial step in the scale development process. The purpose of factor analysis is to reduce many
individual items into a fewer number of dimensions. Factor analysis can be used to simplify data,
such as reducing the number of variables in regression models. The overall objective of factor
analysis is data summarization and data reduction. A central aim of factor analysis is the orderly
simplification of a number of interrelated measures. Factor analysis describes the data using
many fewer dimensions than original variables.

Factor analysis is used to identify "factors" that explain a variety of results on different tests. For
example, intelligence research found that people who get a high score on a test of verbal ability
are also good on other tests that require verbal abilities.

Auto-Encoder

An autoencoder is an unsupervised learning technique for neural networks that learns efficient
data representations (encoding) by training the network to ignore signal “noise.” Autoencoders
can be used for image denoising, image compression, and, in some cases, even generation of
image data.

What is auto encoder in machine learning?

Autoencoder is a type of neural network that can be used to learn a compressed representation of
raw data. ... The encoder compresses the input and the decoder attempts to recreate the input
from the compressed version provided by the encoder. After training, the encoder model is saved
and the decoder is discarded

29
Forward Selection

Forward selection is a type of stepwise regression which begins with an empty model and adds in
variables one by one. In each forward step, you add the one variable that gives the single best
improvement to your model. It is one of two commonly used methods of stepwise regression; the
other is backward elimination, and is almost opposite. In that, you start with a model that
includes every possible variable and eliminate the extraneous variables one by one.

Forward selection typically begins with only an intercept. One tests the various variables that
may be relevant, and the ‘best’ variable—where “best” is determined by some pre-determined
criteria—is added to the model. As the model continues to improve (per that same criteria) we
continue the process, adding in one variable at a time and testing at each step. Once the model no
longer improves with adding more variables, the process stops. The criterion used to determine
which variable goes in when are varied. You could be attempting to find the lowest score under
cross validation, the lowest p-value, or any of a number of other tests or measures of
accuracy.Since stepwise regression tends toward over-fitting, which happens when we put in
more variables than is actually good for the model; it typically shows a very close, neat fit of the
data used in regression, but the model will be far off from additional data points and not good for
interpolation. Therefore, it is usually good to have strict criteria for adding in any variables.

Forward selection is an iterative method in which we start with having no feature in the model.
In each iteration, we keep adding the feature which best improves our model till an addition of a
new variable does not improve the performance of the model.

Score comparison

In machine learning, scoring is the process of applying an algorithmic model built from a
historical dataset to a new dataset in order to uncover practical insights that will help solve a
business problem

The primary objective of model comparison and selection is definitely better performance of the
machine learning software/solution. The objective is to narrow down on the best algorithms that
suit both the data and the business requirements.

30
11.CONCLUSION

Dimensionality reduction has proven useful in discovering non-linear, non-local relationships in

the data that are not obvious in the feature space. In machine learning this is critical and hence
powerful when applied. As the dimensionality increases, the number of data points required for
good performance of any machine learning algorithm increases exponentially.

31
12.FUTURE SCOPE:

The application who are trying to have or develop a better efficient and better
accurate project there is a wide range of methodologies that are available in this
algorithms. Dimensionality reduction will have a better performance as the basic
negatives and curse if the algorithms are being nullified and better accuracy and
better performance will be achieved. So, the application of this will be at a viewed
in a border way

32
13.REFERENCE:

• https://www.geeksforgeeks.org/

• https://www.guru99.com/

• https://www.edureka.co/

• https://www.javatpoint.com/

• https://www.sciencedirect.com/

• https://www.tutorialspoint.com/

33
34

A Primer of Ecological Statistics, 2nd Edition
83% (6)
A Primer of Ecological Statistics, 2nd Edition
638 pages
Seminar Presentation On Deep Learning
100% (2)
Seminar Presentation On Deep Learning
39 pages
Flight Fare Prediction Final
No ratings yet
Flight Fare Prediction Final
65 pages
Tic Tac Toe
No ratings yet
Tic Tac Toe
55 pages
Thesis
No ratings yet
Thesis
73 pages
Ml Customer Segmentation
No ratings yet
Ml Customer Segmentation
39 pages
it
No ratings yet
it
45 pages
A15 Final Document
No ratings yet
A15 Final Document
68 pages
Malware Detection Using Machine Learning
No ratings yet
Malware Detection Using Machine Learning
112 pages
Internship Report On Machine Learning With Python
100% (1)
Internship Report On Machine Learning With Python
50 pages
TECHNICAL_SEMINAR 3
No ratings yet
TECHNICAL_SEMINAR 3
18 pages
Student Performance Analysis Using Machine Learning
No ratings yet
Student Performance Analysis Using Machine Learning
40 pages
5 Students Academic Performance SINGLE FINAL JAIPRAKASH[1][1]
No ratings yet
5 Students Academic Performance SINGLE FINAL JAIPRAKASH[1][1]
83 pages
Ml Report Sayanee
No ratings yet
Ml Report Sayanee
23 pages
Python Training Report (ML)
No ratings yet
Python Training Report (ML)
19 pages
1822-b.e-cse-batchno-150
No ratings yet
1822-b.e-cse-batchno-150
64 pages
Predicting Health Insurance Claim Frauds Using Machine Learning
No ratings yet
Predicting Health Insurance Claim Frauds Using Machine Learning
11 pages
Coronavirus Disease (Covid-19) Cases Analysis Using Machine Learning
No ratings yet
Coronavirus Disease (Covid-19) Cases Analysis Using Machine Learning
11 pages
Major (1) 1 (1) 2
No ratings yet
Major (1) 1 (1) 2
52 pages
Minor Project Report
No ratings yet
Minor Project Report
69 pages
Paper 5
No ratings yet
Paper 5
44 pages
Introduction To Machine Learning PDF
100% (1)
Introduction To Machine Learning PDF
17 pages
Andriya-Seminar Repot (1) ..
No ratings yet
Andriya-Seminar Repot (1) ..
28 pages
Artificial Intelligence and Machine Learning for EDGE Computing 1st Edition Rajiv Pandey - eBook PDF - Download the ebook and explore the most detailed content
100% (2)
Artificial Intelligence and Machine Learning for EDGE Computing 1st Edition Rajiv Pandey - eBook PDF - Download the ebook and explore the most detailed content
56 pages
1822 B.E Cse Batchno 34
No ratings yet
1822 B.E Cse Batchno 34
35 pages
Sat - 34.Pdf - A Systematic Approach Towards Description and Classification of Crime Incidents
No ratings yet
Sat - 34.Pdf - A Systematic Approach Towards Description and Classification of Crime Incidents
11 pages
SECOND Draft Major Project
No ratings yet
SECOND Draft Major Project
72 pages
Binder 1
No ratings yet
Binder 1
93 pages
Master Machine Learning Algorithms Discover How They Work And Implement Them From Scratch 11 Jason Brownlee instant download
100% (1)
Master Machine Learning Algorithms Discover How They Work And Implement Them From Scratch 11 Jason Brownlee instant download
56 pages
Nisha Report
No ratings yet
Nisha Report
50 pages
Thesis Machine Learning
No ratings yet
Thesis Machine Learning
28 pages
Crime Prediction and Analysis Using Machine Learning
No ratings yet
Crime Prediction and Analysis Using Machine Learning
11 pages
Cryptocurrency Price Prediction Using Deep Learning
No ratings yet
Cryptocurrency Price Prediction Using Deep Learning
52 pages
AI - FINAL Harsha
No ratings yet
AI - FINAL Harsha
16 pages
BTP Report Final 1
No ratings yet
BTP Report Final 1
28 pages
D6_mainpage
No ratings yet
D6_mainpage
10 pages
Final Mini Project123
No ratings yet
Final Mini Project123
52 pages
Machine Learning Mastery for Engineers
From Everand
Machine Learning Mastery for Engineers
Abdellatif Sadeq
No ratings yet
Final Report
No ratings yet
Final Report
60 pages
Visvesvaraya Technological University: City Engineering College
No ratings yet
Visvesvaraya Technological University: City Engineering College
31 pages
Machine Learning
No ratings yet
Machine Learning
7 pages
Final Report 09
No ratings yet
Final Report 09
53 pages
1822 B.tech It Batchno 340
No ratings yet
1822 B.tech It Batchno 340
48 pages
39120035-S.gokulnath Report
No ratings yet
39120035-S.gokulnath Report
50 pages
ML Unit I
No ratings yet
ML Unit I
13 pages
History of Datamining and Its Impact On Society
No ratings yet
History of Datamining and Its Impact On Society
4 pages
Binding Vamshi 1
No ratings yet
Binding Vamshi 1
48 pages
Real Report
No ratings yet
Real Report
62 pages
Sample Report
No ratings yet
Sample Report
24 pages
1971_CHANDRA_NAGA_BHARGAVA_REDDY
No ratings yet
1971_CHANDRA_NAGA_BHARGAVA_REDDY
20 pages
Final Report (1)
No ratings yet
Final Report (1)
31 pages
Technical Seminar Report On AI
No ratings yet
Technical Seminar Report On AI
30 pages
Sinemn Pro
No ratings yet
Sinemn Pro
54 pages
Batch 9
No ratings yet
Batch 9
90 pages
Final Mini Project123-1
No ratings yet
Final Mini Project123-1
56 pages
Phase 2 Final
100% (1)
Phase 2 Final
65 pages
INTERNSHIP REPORT
No ratings yet
INTERNSHIP REPORT
20 pages
Mridul Report
No ratings yet
Mridul Report
43 pages
Final Doc Fin PDF
No ratings yet
Final Doc Fin PDF
87 pages
VEERENDRA Internship Report 1
No ratings yet
VEERENDRA Internship Report 1
42 pages
Mastering Machine Learning: A Comprehensive Guide to Success
From Everand
Mastering Machine Learning: A Comprehensive Guide to Success
Rick Spair
No ratings yet
STAT 111_Statistics _ Probability I Course Outline_2024_2025-2
No ratings yet
STAT 111_Statistics _ Probability I Course Outline_2024_2025-2
6 pages
Hypotheis Testing
No ratings yet
Hypotheis Testing
12 pages
Stats Chap03 Bluman
No ratings yet
Stats Chap03 Bluman
86 pages
Solution Manual for Statistical Inference, Second Edition, George Casella, Roger L. Berger download
100% (4)
Solution Manual for Statistical Inference, Second Edition, George Casella, Roger L. Berger download
37 pages
Probability 4
No ratings yet
Probability 4
2 pages
Hypothesis Testing
No ratings yet
Hypothesis Testing
3 pages
1984 Lewis Computational Methods of Neutron Transport
No ratings yet
1984 Lewis Computational Methods of Neutron Transport
416 pages
New Project of Unicon 3
No ratings yet
New Project of Unicon 3
61 pages
Measures of Central Tendency
100% (15)
Measures of Central Tendency
15 pages
Chapter 3
No ratings yet
Chapter 3
19 pages
CFA Exam Level 1: 2018 Cram Notes
No ratings yet
CFA Exam Level 1: 2018 Cram Notes
48 pages
CASExam 3 Nov 2003
No ratings yet
CASExam 3 Nov 2003
82 pages
Stastical Data Analysis: A Lokeshwari 22N31E0014
No ratings yet
Stastical Data Analysis: A Lokeshwari 22N31E0014
30 pages
ch9 2
No ratings yet
ch9 2
9 pages
Practical Statistics for Geographers and Earth Scientists 1st Edition Nigel Walford download pdf
100% (2)
Practical Statistics for Geographers and Earth Scientists 1st Edition Nigel Walford download pdf
45 pages
Interverens
No ratings yet
Interverens
8 pages
L16 Qcar
No ratings yet
L16 Qcar
9 pages
Qualitative and Quantitative Data Analysis
No ratings yet
Qualitative and Quantitative Data Analysis
30 pages
Bayesian Optimization PDF
No ratings yet
Bayesian Optimization PDF
22 pages
Lecture Notes Ma12003 PDF
100% (1)
Lecture Notes Ma12003 PDF
105 pages
TDOA Equation
100% (1)
TDOA Equation
27 pages
Running A T-Test in Excel
No ratings yet
Running A T-Test in Excel
3 pages
Application of Raspberry Pi Model For Plant Disease Detection IJERTCONV7IS08032 PDF
No ratings yet
Application of Raspberry Pi Model For Plant Disease Detection IJERTCONV7IS08032 PDF
6 pages
Online Shopping Motivation and The Influence of Persuasive Strategies
No ratings yet
Online Shopping Motivation and The Influence of Persuasive Strategies
7 pages
Ridge Regression: Ryota Tomioka Department of Mathema6cal Informa6cs The University of Tokyo
No ratings yet
Ridge Regression: Ryota Tomioka Department of Mathema6cal Informa6cs The University of Tokyo
53 pages
Package Suppdists': R Topics Documented
No ratings yet
Package Suppdists': R Topics Documented
26 pages
DDBTNK
No ratings yet
DDBTNK
18 pages
Stat
No ratings yet
Stat
15 pages

Dimensionality Reduction Algorithms

Uploaded by

Dimensionality Reduction Algorithms

Uploaded by

VIGNAN INSTITUTE OF TECHNOLOGY & SCIENCE

(NBA Accredited & Affiliated to Jawaharlal Nehru Technological University, Hyderabad)

Department of Computer Science and Engineering

A Technical Seminar Report

Dimensionality Reduction Algorithms

Department of Computer Science and Engineering

I have made this report file on the Dimensionality Reduction Algorithms. I

TITLE PAGE NO.

2. TYPES OF MACHINE LEARNING 9

5. ADVANTAGES AND DISADVANTAGES 13

6. PRINCIPLE COMPONENT ALALYSIS 16

7. INDEPENDENT COMPONENT ANALYSIS 18

8. METHODS BASED ON PROJECTIONS 20

9. DISTRIBUTED STOCHASTIC NEIGHBOR EMBEDDING 22

12. FUTURE SCOPE 32

Name and signature of the student

This is to certify that the Technical Seminar Report entitled “Dimensionality

What is machine learning?

Why is machine learning important?

How does supervised machine learning work?

How does unsupervised machine learning work?

How does semi-supervised learning work?

How does reinforcement learning work?

Who's using machine learning and what's it used for?

5. ADVANTAGES AND DISADVANTAGES

What are the advantages and disadvantages of machine learning?

Importance of human interpretable machine learning

What is the future of machine learning?

As machine learning continues to increase in importance to business operations and AI becomes

Some common terms used in PCA:

Dimensionality: It is the number of features or variables present in the given dataset

Step for PCA algorithm:

 PCA is mainly used as the dimensionality reduction technique in various AI applications

Independent Component Analysis (ICA) is a technique in statistics used to detect hidden factors

image enhancement, and digital signal processing.

 AR: represented as p, is the autoregressive model

 I : represented as d, is the differencing term

 MA: represented as q, is the moving average model

 S: enables us to add a seasonal component

 An unsupervised, randomized algorithm, used only for visualization.

 Outliers do not impact t-SNE

10.Applications of Dimensionality Reduction Algorithms

Common techniques of Dimensionality Reduction

 Principal Component Analysis

Principal Component Analysis

Applications of Principal Component Analysis. PCA is predominantly used as a dimensionality

Missing Value Ratio

Ways to Handle Missing Values in Machine Learning

1) Deleting Rows with missing values.

2) Impute missing values for continuous variable.

3) Impute missing values for categorical variable.

4) Other Imputation Methods.

5) Using Algorithms that support missing values.

What is variance in machine learning?

Low Variance Filter

High Correlation Filter

What is auto encoder in machine learning?

Dimensionality reduction has proven useful in discovering non-linear, non-local relationships in

You might also like