0% found this document useful (0 votes)
33 views117 pages

Dimensionality Reduction

Dimensionality reduction is essential for effective data analysis as it addresses the challenges posed by high-dimensional data, such as decreased accuracy and increased computational complexity. It can be achieved through feature selection and feature extraction methods, which help in reducing the number of features while retaining the most informative ones. Genetic algorithms are also highlighted as a powerful optimization technique for navigating large search spaces in feature selection.

Uploaded by

Asma Ayub
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
33 views117 pages

Dimensionality Reduction

Dimensionality reduction is essential for effective data analysis as it addresses the challenges posed by high-dimensional data, such as decreased accuracy and increased computational complexity. It can be achieved through feature selection and feature extraction methods, which help in reducing the number of features while retaining the most informative ones. Genetic algorithms are also highlighted as a powerful optimization technique for navigating large search spaces in feature selection.

Uploaded by

Asma Ayub
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 117

DIMENSIONALITY

REDUCTION
WHY DIMENSIONALITY REDUCTION?
Generally, it is easy and convenient to collect data
An experiment

Data accumulates in an unprecedented speed


Data preprocessing is an important part for effective pattern
recognition
Dimensionality reduction is an effective approach to downsizing
data

2
WHY DIMENSIONALITY REDUCTION?
Most machine learning/pattern recognition techniques may not be
effective for high-dimensional data
Curse of Dimensionality
Accuracy and efficiency may degrade rapidly as the dimension increases.

The intrinsic dimension may be small.


For example, the number of genes responsible for a certain type of disease
may be small

3
WHY DIMENSIONALITY REDUCTION?
Visualization: projection of high-dimensional data onto 2D or
3D.

Data compression: efficient storage and retrieval.

Noise removal: positive effect on query accuracy.

4
DOCUMENT CLASSIFICATION Terms
Web Pages
Email
s T1 T2 ….…… TN C
Sp
D1 12 0 ….…… 6 ort
Tr
s
av
D2 3 10 ….…… 28
Documents el



D Jo
0 11 ….…… 16
M
bs
Intern
et ■ Task: To classify unlabeled
documents into categories
ACM IEEE ■ Challenge: thousands of
PubMed
Portal Xplore terms
Digital ■ Solution: to apply
Libraries dimensionality reduction

5
OTHER EXAMPLES

Face images Handwritten digits


6
DIMENSIONALITY REDUCTION
Reduces time complexity: Less computation

Reduces space complexity: Less parameters

Saves the cost of observing/computing the feature

7
DIMENSIONALITY REDUCTION
Key methods of dimensionality reduction

Feature Selection Feature Extraction

8
FEATURE SELECTION VS EXTRACTION
Feature selection:

Choosing k<d important features, ignoring the remaining d


– k. These are Subset selection algorithms

Feature extraction:

Project the original xi , i =1,...,d dimensions to new k<d


dimensions, zj , j =1,...,k

9
FEATURE SELECTION

Choosing an optimal subset of features


(Subset of m out of n)

10
FEATURE EXTRACTION

Mapping of the original high-dimensional


data onto a lower-dimensional space
11
FEATURE EXTRACTION
Given a set of data points of p variables

Compute their low-dimensional representation:

12
Feature Selection

13
CONTENTS: FEATURE SELECTION
Introduction

Feature subset search

Models for Feature Selection


Filters

Wrappers
Genetic Algorithm

14
INTRODUCTION
You have some data, and you want to use it to build
a classifier, so that you can predict something (e.g.
likelihood of cancer)

The data has 10,000 fields (features)

15
INTRODUCTION
You have some data, and you want to use it to build
a classifier, so that you can predict something (e.g.
likelihood of cancer)

The data has 10,000 fields (features)


You need to cut it down to 1,000 fields before
you try machine learning. Which 1,000?

16
INTRODUCTION
You have some data, and you want to use it to build
a classifier, so that you can predict something (e.g.
likelihood of cancer)

The data has 10,000 fields (features)


You need to cut it down to 1,000 fields before
you try machine learning. Which 1,000?

This process of choosing the 1,000 fields to


use is an example of Feature Selection
17
DATA SETS WITH MANY FEATURES
• Gene expression datasets (~10,000 features)
• http://www.ncbi.nlm.nih.gov/sites/entrez?db=gds

• Proteomics data (~20,000 features)


• http://www.ebi.ac.uk/pride/

18
FEATURE SELECTION: WHY?

Source: http://elpub.scix.net/data/works/att/02-28.content.pdf 19
FEATURE SELECTION: WHY?
Quite easy to find lots more cases from papers, where
experiments show that accuracy reduces when you use more
features

Questions?
Why does accuracy reduce with more features?
How does it depend on the specific choice of features?
What else changes if we use more features?
So, how do we choose the right features?

20
WHY ACCURACY REDUCES ?
Note: Suppose the best feature set has 20
features. If you add another 5 features,
typically the accuracy of machine learning
may reduce. But you still have the original
20 features!! Why does this happen???

21
NOISE/EXPLOSION
The additional features typically add noise

Machine learning will pick up on spurious correlations, that might


be true in the training set, but not in the test set

For some ML methods, more features means more parameters to


learn (more NN weights, more decision tree nodes, etc…) – the
increased space of possibilities is more difficult to search

22
FEATURE SUBSET SEARCH
x2 x2

x1 x1

X2 is important, X1 is not X1 is important, X2 is not


x3

x2

x1 X1 and X2 are
important, X3 is23not
SUBSET SEARCH PROBLEM
An example of search space (Kohavi & John 1997)

Forward Backward

24
DIFFERENT ASPECTS OF SEARCH
Search starting points
⚫ Empty set
⚫ Full set
⚫ Random point

Search directions
⚫ Sequential forward selection
⚫ Sequential backward elimination
⚫ Bidirectional generation
⚫ Random generation

Search Strategies
⚫ Exhaustive/Complete
⚫ Heuristics

25
EXHAUSTIVE SEARCH
Original dataset has N features
You want to use a subset of k features
A complete FS method means: try every subset of k features, and
choose the best!
The number of subsets is N! / k!(N−k)!
What is this when N is 100 and k is 5?
75,287,520

What is this when N is 10,000 and k is 100?


Actually it is around 5 × 1035,101

There are around 1080 atoms in the universe

26
FORWARD SEARCH
• These methods `grow’ a set S of features –
• S starts empty
• Find the best feature to add (by checking
which one gives best performance on a
validation set when combined with S).
• If overall performance has improved, return
to step 2; else stop

27
BACKWARD SEARCH
• These methods ‘remove’ features one by
one.
• S starts with the full feature set
• Find the best feature to remove (by checking
which removal from S gives best
performance on a validation set)
• If overall performance has improved, return
to step 2; else stop

28
MODELS FOR FEATURE SELECTION
Two models for Feature Selection
Filter methods
Carry out feature selection independent of any learning algorithm and the features are selected
as a pre-processing step
Wrapper methods
Use the performance of a learning machine as a black box to score feature subsets

29
FILTER METHODS
A filter method does not make use of the classifier,
but rather attempts to find predictive subsets of the
features by making use of simple statistics
computed from the empirical distribution.

30
FILTER METHODS

31
FILTER METHODS
Ranking/Scoring of features
Select best individual features. A feature evaluation function is used to rank
individual features, then the highest ranked m features are selected.
Although these methods can exclude irrelevant features, they often include
redundant features.
Pearson correlation coefficient

32
FILTER METHODS
Minimum Redundancy Maximum Relevance

a good
predictor set

maximum minimum
relevance redundancy

maximal power in minimal correlation


discriminating between among features (members
different classes of predictor set)

33
WRAPPER METHODS
Given a classifier C and a set of feature F, a
wrapper method searches in the space of subsets
of F, using cross validation to compare the
performance of the trained classifier C on each
tested subset.

34
WRAPPER METHODS

35
WRAPPER METHODS
Say we have predictors A, B, C and classifier M. We want to
find the smallest possible subset of {A,B,C}, while achieving
maximal performance

FEATURE SET CLASSIFIER PERFORMANCE


{A,B,C} M 98%
{A,B} M 98%
{A,C} M 77%
{B,C} M 56%
{A} M 89%
{B} M 90%
{C} M 91%
{.} M 85%
36
“Genetic Algorithms are good at taking large, potentially huge search
spaces and navigating them, looking for optimal combinations of
things, solutions you might not otherwise find in a lifetime.”

- Salvatore Mangano
Computer Design, May 1995

Genetic Algorithms

37
FIRST – A BIOLOGY LESSON
A gene is a unit of heredity in a living organism
Genes are connected together into long strings called chromosomes
A gene represents a specific trait of the organism, like eye colour or
hair colour, and has several different settings.
For example, the settings for a hair colour gene may be blonde, black or brown
etc.

These genes and their settings are usually referred to as an


organism's genotype.
The physical expression of the genotype – the organism itself - is
called the phenotype.

38
FIRST – A BIOLOGY LESSON
Offsprings inherit traits from parents
An offsporint may end up having half the genes from one parent
and half from the other - recombination
Very occasionally a gene may be mutated – Expressed in an
organism as a completely new trait
For example: A child may have green eyes while none of the parents had

39
GENETIC ALGORITHM IS
… Computer algorithm
That resides on principles of genetics and evolution

40
GENETIC ALGORITHMS
Search algorithms based on the mechanics of biological evolution
Developed by John Holland, University of Michigan (1970’s)
Provide efficient, effective techniques for optimization and machine learning
applications
Widely-used today in business, scientific and engineering circles

41
GENETIC ALGORITHM
Genetic algorithm (GA) introduces the principle of evolution and
genetics into search among possible solutions to given problem

The idea is to simulate the process in natural systems

This is done by the creation within a machine a population of


individuals

42
GENETIC ALGORITHM
Survival of the fittest
⚫ The main principle of evolution used in GA
is “survival of the fittest”.
⚫ The good solution survive, while bad ones die.

43
APPLICATIONS: OPTIMIZATION
Assume an individual is going to give you z dollars,
after you tell them the value of x and y

x and y in the range of 0 to


10

44
z

x y 45
EXAMPLE PROBLEM I
(CONTINUOUS)

y=
f(x)

Finding the maximum (minimum) of


some function (within a defined
range).
EXAMPLE PROBLEM II
(DISCRETE)

The Traveling Salesman Problem (TSP)


A salesman spends his time visiting n
cities. In one tour he visits each city just
once, and finishes up where he started. In
what order should he visit them to
minimize the distance traveled?
There are (n-1)!/2 possible tours.
GENETIC ALGORITHM
Inspired by natural evolution
Population of individuals
⚫ Individual is feasible solution to problem

Each individual is characterized by a Fitness function


⚫ Higher fitness is better solution

Based on their fitness, parents are selected to produce offspring


for a new generation
⚫ Fitter individuals have more chance to reproduce
⚫ New generation has same size as old generation; old generation dies

Offspring has combination of properties of two parents


If well designed, population will converge to optimal solution

48
GENETIC ALGORITHM

49
GENETIC ALGORITHM
Coding or Representation
Possible solutions to problem

Fitness function
Parent selection

Reproduction
Crossover
Mutation

Convergence
When to stop

50
CODING – EXAMPLE: FEATURE
SELECTION
Assume we have 15 features f1 to f15
Generate binary strings of 15 bits as initial population

This is initial population


1011100011100 Population Size= User
01 defined parameter
One gene 0111011111000
00
…………………………
……
One row = one chromosome…………………………
=one
……
individual of population
1111000111011
01
1 means the feature is used – 0 means the
feature is not used
51
CODING: EXAMPLE
The Traveling Salesman Problem:

Find a tour of a given set of cities so that


Each city is visited only once
The total distance traveled is minimized
Representation is an ordered list of city numbers
1) London 3) Dunedin 5) Beijing 7) Tokyo
2) Venice 4) Singapore 6) Phoenix 8) Victoria

CityList1 (3 5 7 2 1 6 4 8)
CityList2 (2 5 7 6 8 1 3 4)
52
FITNESS FUNCTION/PARENT
SELECTION
Fitness function evaluates how good an individual is in solving the
problem

Fitness is computed for each individual

Fitness function is application depended

For classification – we may use the classification rate as the fitness


function

Find the fitness value of each individual in the population


53
FITNESS FUNCTION/PARENT
SELECTION
Parent/Survivor Selection
RouletteWheel Selection

Tournament Selection

Rank Selection

Elitist Selection

54
ROULETTE WHEEL SELECTION
Main idea: better individuals get higher chance
Individuals are assigned a probability of being selected based
on their fitness.
pi = fi / Σfj
Where pi is the probability that individual i will be selected,
fi is the fitness of individual i, and
Σfj represents the sum of all the fitnesses of the individuals with the population.

55
ROULETTE WHEEL SELECTION
Assign to each individual a part of the roulette wheel
Spin the wheel n times to select n individuals

1/6 =
17%

A B fitness(A) = 3
C
3/6 = 2/6 =
fitness(B) = 1
50% 33%
fitness(C) = 2

56
TOURNAMENT SELECTION
Binary tournament
Two individuals are randomly chosen; the fitter of the two is selected as a
parent

Larger tournaments
n individuals are randomly chosen; the fittest one is selected as a parent

57
OTHER METHODS
Rank Selection
Each individual in the population is assigned a numerical rank based on
fitness, and selection is based on this ranking.

Elitism
Reserve k slots in the next generation for the highest scoring/fittest
chormosomes of the current generation

58
REPRODUCTION
Reproduction operators
Crossover
Mutation

Crossover is usually the primary operator with mutation serving


only as a mechanism to introduce diversity in the population

59
REPRODUCTION
Crossover
Two parents produce two offspring
There is a chance that the chromosomes of the two parents are copied
unmodified as offspring
There is a chance that the chromosomes of the two parents are randomly
recombined (crossover) to form offspring
Generally the chance of crossover is between 0.6 and 1.0

Mutation
There is a chance that a gene of a child is changed randomly
Generally the chance of mutation is low (e.g. 0.001)

60
CROSSOVER
Generating offspring from two selected parents
Single point crossover
Two point crossover (Multi point crossover)
Uniform crossover

61
ONE PONT CROSSOVER
Choose a random point on the two parents
Split parents at this crossover point
Create children by exchanging tails

62
ONE PONT CROSSOVER
Choose a random point on the two parents
Split parents at this crossover point
Create children by exchanging tails

Parent 1: XX|XXXXX
Parent 2: YY|YYYYY
Offspring 1: XXYYYYY
Offspring 2: YYXXXXX

63
CROSSOVER
00101111000110

00110010001100

00101111001000

11101100000001

Crossover point
64
CROSSOVER
00101111000110 00101100000001

00110010001100

00101111001000

11101100000001 11101111000110

65
CROSSOVER
00101111000110 00101100000001

00110010001100 00110111001000

00101111001000 00101010001100

11101100000001 11101111000110

66
TWO POINT CORSSOVER
Two-Point crossover is very similar to single-point crossover except
that two cut-points are generated instead of one.

Parent 1: XX|XXX|
XX
Parent 2: YY|YYY|
YY
Offspring 1: XXYYYXX
Offspring 2: YYXXXYY

67
N POINT CROSSOVER
Choose n random crossover points
Split along those points
Glue parts, alternating between parents

68
UNIFORM CORSSOVER
A random mask is generated
The mask determines which bits are copied from one parent
and which from the other parent
Bit density in mask determines how much material is taken from
the other parent

Mask: 0110011000 (Randomly generated)


Parents: 1010001110 0011010010

Offspring: 0011001010 1010010110

69
MUTATION
Alter each gene independently with a probability pm
pm is called the mutation rate
Typically between 1/pop_size and 1/ chromosome_length

70
SUMMARY – REPRODUCTION CYCLE
Select parents for producing the next generation
For each consecutive pair apply crossover with probability pc ,
otherwise copy parents
For each offspring apply mutation (bit-flip with probability pm)
Replace the population with the resulting population of offsprings

71
CONVERGENCE
Stop Criterion
Number of generations
Fitness value
How fit is the fittest individual

72
GA FOR FEATURE SELECTION
The initial population is randomly generated
Each chromosome is evaluated using the fitness function
The fitness values of the current population are used to find the off
springs of the next generation
The generational process ends when the termination criterion is
satisfied
The selected features correspond to the best individual in the last
generation

73
GA FOR FEATURE SELECTION
GA can be executed multiple times
Example: 15 features, GA executed 10 times

74
GA FOR FEATURE SELECTION
Feature categories based on frequency of selection
Indispensable:
Feature selected in each selected feature subset.

Irrelevant:
Feature not selected in any of the selected subsets.

Partially Relevant:
Feature selected in some of the subsets.

75
GA WORKED EXAMPLE
Suppose that we have a rotary system (which could be
mechanical - like an internal combustion engine or gas turbine,
or electrical - like an induction motor).

The system has five parameters associated with it - a, b, c, d


and e. These parameters can take any integer value between 0
and 10.

When we adjust these parameters, the system responds by


speeding up or slowing down.

Our aim is to obtain the highest speed possible in revolutions


per minute from the system.
76
Step
1

GA WORKED EXAMPLE
Generate a population of random strings (we’ll use ten as an
example):

77
Step
2

GA WORKED EXAMPLE
Feed each of these strings into the machine, in turn, and
measure the speed in revolutions per minute of the machine. This
value is the fitness because the higher the speed, the better the
machine:

78
Step
3

GA WORKED EXAMPLE
To select the breeding population, we’ll go the easy route and
sort the strings then delete the worst ones. First sorting:

79
Step
3

GA WORKED EXAMPLE
Having sorted the strings into order we delete the worst half

80
Step
4

GA WORKED EXAMPLE
We can now crossover the strings by pairing them up randomly.
Since there’s an odd number, we’ll use the best string twice. The
pairs are shown below:

81
Step
4

GA WORKED EXAMPLE
The crossover points are selected randomly and are shown by the
vertical lines. After crossover the strings look like this:

These can now join their parents in the


next generation
82
Step
4

GA WORKED EXAMPLE
New Generation

83
Step
4

GA WORKED EXAMPLE
We have one extra string (which we
picked up by using an odd number of
strings in the mating pool) that we can
delete after fitness testing.

84
Step
5

GA WORKED EXAMPLE
Finally, we have mutation, in which a
small number of numbers are changed

85
GA WORKED EXAMPLE
After this, we repeat the algorithm from stage 2, with this new
population as the starting point.

Keep repeating until convergence

86
GA WORKED EXAMPLE
Roulette Wheel Selection
The alternative (roulette) method of selection would make up a breeding
population by giving each of the old strings a chance of ending up in the
breeding population which is proportional to its fitness

Making the fitness for each string the addition of its own fitness with all of
those before it

87
GA WORKED EXAMPLE
Roulette Wheel Selection

1. If we now generate a random number between 0 and 10280


we can use this to select strings.
2. If the random number turns out to be between 0 and 100 then
we choose the last string.
3. If it’s between 8080 and 10280 we choose the first string.
4. If it’s between 2480 and 3480 we choose the string 3 6 8 6 9,
etc. You don’t have to sort the strings into order to use this
method.

88
Feature Extraction

89
FEATURE EXTRACTION
Unsupervised
Principal Component Analysis (PCA)
Independent Component Analysis (ICA)

Supervised
Linear Discriminant Analysis (LDA)

90
PRINCIPAL COMPONENT ANALYSIS
(PCA)
PCA is one of the most common feature extraction techniques
Reduce the dimensionality of a data set by finding a new set of
variables, smaller than the original set of variables
Allows us to combine much of the information contained in n
features into m features where m < n

91
PCA – INTRODUCTION

• The 1st PC is a minimum distance fit to a line in X space


• The 2nd PC is a minimum distance fit to a line in the plane
perpendicular to the 1st PC
PCs are a series of linear least squares fits to a sample,
each orthogonal to all the previous.
93
PRINCIPAL COMPONENT ANALYSIS
(PCA)

94
PCA – INTRODUCTION
Transform n-dimensional data to a new n-dimensions
The new dimension with the most variance is the first principal
component
The next is the second principal component, etc.

95
Recap

VARIANCE AND COVARIANCE


Variance is a measure of data spread in one dimension (feature)
Covariance measures how two dimensions (features) vary with
respect to each other

96
Recap

COVARIANCE
Focus on the sign (rather than exact value) of covariance
Positive value means that as one feature increases or decreases the other does
also (positively correlated)
Negative value means that as one feature increases the other decreases and
vice versa (negatively correlated)
A value close to zero means the features are independent

97
Recap

COVARIANCE MATRIX
Covariance matrix is an n × n matrix containing the covariance
values for all pairs of features in a data set with n features
(dimensions)

The diagonal contains the covariance of a feature with itself


which is the variance (which is the square of the standard
deviation)

The matrix is symmetric

98
PCA – MAIN STEPS
Center data around 0

Form the covariance matrix S.

Compute its eigenvectors:

The first p eigenvectors form the p PCs.

The transformation G consists of the p PCs.

99
Data

PCA – WORKED EXAMPLE

100
Step
1

PCA – WORKED EXAMPLE


First step is to center the original data around 0
Subtract mean from each value

101
Step 2

PCA – WORKED EXAMPLE


Calculate the covariance matrix of the centered data – Only 2 ×
2 for this case

102
Step 3

PCA – WORKED EXAMPLE


Calculate the eigenvectors and eigenvalues of the covariance
matrix (remember linear algebra)
Covariance matrix – square n × n ; n eigenvalues will exist
All eigenvectors (principal components/dimensions) are orthogonal to each
other and will make a new set of dimensions for the data
The magnitude of each eigenvalue corresponds to the variance along that
new dimension – Just what we wanted!
We can sort the principal components according to their eigenvalues
Just keep those dimensions with largest eigenvalues

103
Step 3

PCA – WORKED EXAMPLE

Two eigenvectors
overlaying the centered
data

104
Step 4

PCA – WORKED EXAMPLE


Just keep the p eigenvectors with the largest eigenvalues
Do lose some information, but if we just drop dimensions with small
eigenvalues then we lose only a little information
We can then have p input features rather than n
How many dimensions p should we keep?

Eigenvalue
Proportion
of Variance

1234567…n
105
Step 4

PCA – WORKED EXAMPLE


Proportion of Variance (PoV)

when λi are sorted in descending order


Typically, stop at PoV>0.9

106
Step 4

PCA – WORKED EXAMPLE

107
Step 4

PCA – WORKED EXAMPLE


Transform the features to the p chosen Eigenvectors
Take the p eigenvectors that you want to keep from the list of
eigenvectors, and forming a matrix with these eigenvectors in the
columns.

For our example; either keep both vectors or chose to leave out the
smaller less significant one

O
R
108
Step 5

PCA – WORKED EXAMPLE


FinalData = RowFeatureVector x RowDataAdjust

RowFeatureVector is matrix with eigenvectors in the columns


transposed so that eigenvectors are now in the rows with most
significant eigenvector at the top

RowDataAdjust is the mean-adjusted data transposed

FinalData is the final data set with data items in columns and
dimensions along the rows

109
Step 5

PCA – WORKED EXAMPLE

110
Step 5

PCA – WORKED EXAMPLE

111
Step 5

PCA – WORKED EXAMPLE

112
PCA – WORKED EXAMPLE
Getting back original data
FinalData
We used the transofrmation = RowFeatureVector x RowDataAdjust

This gives
RowDataAdjust = RowFeatureVector-1 x FinalData

In our case, inverse of feature vector is equal to its transpose

RowDataAdjust = RowFeatureVectorT x FinalData

113
PCA – WORKED EXAMPLE
Getting back original data
Add mean to get back raw data:

OriginalData= (RowFeatureVectorT x FinalData) + Mean


If we use all (two in our case) eigenvectors we get back exactly the original data
With one eigenvector, some information is lost

114
PCA – WORKED EXAMPLE

Original Original Data restored with one


Data eigenvector
115
PCA APPLICATIONS
Eigenfaces for recognition. Turk and Pentland. 1991.

Principal Component Analysis for clustering gene expression data. Yeung and Ruzzo.
2001.

Probabilistic Disease Classification of Expression-Dependent Proteomic Data from Mass


Spectrometry of Human Serum. Lilien. 2003.

116
ACKNOWLEDGEMENTS
Chapter 6, Introduction to Machine Learning, E. Alpyadin, MIT Press.

Most slides in this presentation are adopted from slides of text book and various
sources. The Copyright belong to the original authors.

117

You might also like