Dimensionality Reduction
Dimensionality Reduction
REDUCTION
WHY DIMENSIONALITY REDUCTION?
Generally, it is easy and convenient to collect data
An experiment
2
WHY DIMENSIONALITY REDUCTION?
Most machine learning/pattern recognition techniques may not be
effective for high-dimensional data
Curse of Dimensionality
Accuracy and efficiency may degrade rapidly as the dimension increases.
3
WHY DIMENSIONALITY REDUCTION?
Visualization: projection of high-dimensional data onto 2D or
3D.
4
DOCUMENT CLASSIFICATION Terms
Web Pages
Email
s T1 T2 ….…… TN C
Sp
D1 12 0 ….…… 6 ort
Tr
s
av
D2 3 10 ….…… 28
Documents el
…
…
D Jo
0 11 ….…… 16
M
bs
Intern
et ■ Task: To classify unlabeled
documents into categories
ACM IEEE ■ Challenge: thousands of
PubMed
Portal Xplore terms
Digital ■ Solution: to apply
Libraries dimensionality reduction
5
OTHER EXAMPLES
7
DIMENSIONALITY REDUCTION
Key methods of dimensionality reduction
8
FEATURE SELECTION VS EXTRACTION
Feature selection:
Feature extraction:
9
FEATURE SELECTION
10
FEATURE EXTRACTION
12
Feature Selection
13
CONTENTS: FEATURE SELECTION
Introduction
Wrappers
Genetic Algorithm
14
INTRODUCTION
You have some data, and you want to use it to build
a classifier, so that you can predict something (e.g.
likelihood of cancer)
15
INTRODUCTION
You have some data, and you want to use it to build
a classifier, so that you can predict something (e.g.
likelihood of cancer)
16
INTRODUCTION
You have some data, and you want to use it to build
a classifier, so that you can predict something (e.g.
likelihood of cancer)
18
FEATURE SELECTION: WHY?
Source: http://elpub.scix.net/data/works/att/02-28.content.pdf 19
FEATURE SELECTION: WHY?
Quite easy to find lots more cases from papers, where
experiments show that accuracy reduces when you use more
features
Questions?
Why does accuracy reduce with more features?
How does it depend on the specific choice of features?
What else changes if we use more features?
So, how do we choose the right features?
20
WHY ACCURACY REDUCES ?
Note: Suppose the best feature set has 20
features. If you add another 5 features,
typically the accuracy of machine learning
may reduce. But you still have the original
20 features!! Why does this happen???
21
NOISE/EXPLOSION
The additional features typically add noise
22
FEATURE SUBSET SEARCH
x2 x2
x1 x1
x2
x1 X1 and X2 are
important, X3 is23not
SUBSET SEARCH PROBLEM
An example of search space (Kohavi & John 1997)
Forward Backward
24
DIFFERENT ASPECTS OF SEARCH
Search starting points
⚫ Empty set
⚫ Full set
⚫ Random point
Search directions
⚫ Sequential forward selection
⚫ Sequential backward elimination
⚫ Bidirectional generation
⚫ Random generation
Search Strategies
⚫ Exhaustive/Complete
⚫ Heuristics
25
EXHAUSTIVE SEARCH
Original dataset has N features
You want to use a subset of k features
A complete FS method means: try every subset of k features, and
choose the best!
The number of subsets is N! / k!(N−k)!
What is this when N is 100 and k is 5?
75,287,520
26
FORWARD SEARCH
• These methods `grow’ a set S of features –
• S starts empty
• Find the best feature to add (by checking
which one gives best performance on a
validation set when combined with S).
• If overall performance has improved, return
to step 2; else stop
27
BACKWARD SEARCH
• These methods ‘remove’ features one by
one.
• S starts with the full feature set
• Find the best feature to remove (by checking
which removal from S gives best
performance on a validation set)
• If overall performance has improved, return
to step 2; else stop
28
MODELS FOR FEATURE SELECTION
Two models for Feature Selection
Filter methods
Carry out feature selection independent of any learning algorithm and the features are selected
as a pre-processing step
Wrapper methods
Use the performance of a learning machine as a black box to score feature subsets
29
FILTER METHODS
A filter method does not make use of the classifier,
but rather attempts to find predictive subsets of the
features by making use of simple statistics
computed from the empirical distribution.
30
FILTER METHODS
31
FILTER METHODS
Ranking/Scoring of features
Select best individual features. A feature evaluation function is used to rank
individual features, then the highest ranked m features are selected.
Although these methods can exclude irrelevant features, they often include
redundant features.
Pearson correlation coefficient
32
FILTER METHODS
Minimum Redundancy Maximum Relevance
a good
predictor set
maximum minimum
relevance redundancy
33
WRAPPER METHODS
Given a classifier C and a set of feature F, a
wrapper method searches in the space of subsets
of F, using cross validation to compare the
performance of the trained classifier C on each
tested subset.
34
WRAPPER METHODS
35
WRAPPER METHODS
Say we have predictors A, B, C and classifier M. We want to
find the smallest possible subset of {A,B,C}, while achieving
maximal performance
- Salvatore Mangano
Computer Design, May 1995
Genetic Algorithms
37
FIRST – A BIOLOGY LESSON
A gene is a unit of heredity in a living organism
Genes are connected together into long strings called chromosomes
A gene represents a specific trait of the organism, like eye colour or
hair colour, and has several different settings.
For example, the settings for a hair colour gene may be blonde, black or brown
etc.
38
FIRST – A BIOLOGY LESSON
Offsprings inherit traits from parents
An offsporint may end up having half the genes from one parent
and half from the other - recombination
Very occasionally a gene may be mutated – Expressed in an
organism as a completely new trait
For example: A child may have green eyes while none of the parents had
39
GENETIC ALGORITHM IS
… Computer algorithm
That resides on principles of genetics and evolution
40
GENETIC ALGORITHMS
Search algorithms based on the mechanics of biological evolution
Developed by John Holland, University of Michigan (1970’s)
Provide efficient, effective techniques for optimization and machine learning
applications
Widely-used today in business, scientific and engineering circles
41
GENETIC ALGORITHM
Genetic algorithm (GA) introduces the principle of evolution and
genetics into search among possible solutions to given problem
42
GENETIC ALGORITHM
Survival of the fittest
⚫ The main principle of evolution used in GA
is “survival of the fittest”.
⚫ The good solution survive, while bad ones die.
43
APPLICATIONS: OPTIMIZATION
Assume an individual is going to give you z dollars,
after you tell them the value of x and y
44
z
x y 45
EXAMPLE PROBLEM I
(CONTINUOUS)
y=
f(x)
48
GENETIC ALGORITHM
49
GENETIC ALGORITHM
Coding or Representation
Possible solutions to problem
Fitness function
Parent selection
Reproduction
Crossover
Mutation
Convergence
When to stop
50
CODING – EXAMPLE: FEATURE
SELECTION
Assume we have 15 features f1 to f15
Generate binary strings of 15 bits as initial population
CityList1 (3 5 7 2 1 6 4 8)
CityList2 (2 5 7 6 8 1 3 4)
52
FITNESS FUNCTION/PARENT
SELECTION
Fitness function evaluates how good an individual is in solving the
problem
Tournament Selection
Rank Selection
Elitist Selection
54
ROULETTE WHEEL SELECTION
Main idea: better individuals get higher chance
Individuals are assigned a probability of being selected based
on their fitness.
pi = fi / Σfj
Where pi is the probability that individual i will be selected,
fi is the fitness of individual i, and
Σfj represents the sum of all the fitnesses of the individuals with the population.
55
ROULETTE WHEEL SELECTION
Assign to each individual a part of the roulette wheel
Spin the wheel n times to select n individuals
1/6 =
17%
A B fitness(A) = 3
C
3/6 = 2/6 =
fitness(B) = 1
50% 33%
fitness(C) = 2
56
TOURNAMENT SELECTION
Binary tournament
Two individuals are randomly chosen; the fitter of the two is selected as a
parent
Larger tournaments
n individuals are randomly chosen; the fittest one is selected as a parent
57
OTHER METHODS
Rank Selection
Each individual in the population is assigned a numerical rank based on
fitness, and selection is based on this ranking.
Elitism
Reserve k slots in the next generation for the highest scoring/fittest
chormosomes of the current generation
58
REPRODUCTION
Reproduction operators
Crossover
Mutation
59
REPRODUCTION
Crossover
Two parents produce two offspring
There is a chance that the chromosomes of the two parents are copied
unmodified as offspring
There is a chance that the chromosomes of the two parents are randomly
recombined (crossover) to form offspring
Generally the chance of crossover is between 0.6 and 1.0
Mutation
There is a chance that a gene of a child is changed randomly
Generally the chance of mutation is low (e.g. 0.001)
60
CROSSOVER
Generating offspring from two selected parents
Single point crossover
Two point crossover (Multi point crossover)
Uniform crossover
61
ONE PONT CROSSOVER
Choose a random point on the two parents
Split parents at this crossover point
Create children by exchanging tails
62
ONE PONT CROSSOVER
Choose a random point on the two parents
Split parents at this crossover point
Create children by exchanging tails
Parent 1: XX|XXXXX
Parent 2: YY|YYYYY
Offspring 1: XXYYYYY
Offspring 2: YYXXXXX
63
CROSSOVER
00101111000110
00110010001100
00101111001000
11101100000001
Crossover point
64
CROSSOVER
00101111000110 00101100000001
00110010001100
00101111001000
11101100000001 11101111000110
65
CROSSOVER
00101111000110 00101100000001
00110010001100 00110111001000
00101111001000 00101010001100
11101100000001 11101111000110
66
TWO POINT CORSSOVER
Two-Point crossover is very similar to single-point crossover except
that two cut-points are generated instead of one.
Parent 1: XX|XXX|
XX
Parent 2: YY|YYY|
YY
Offspring 1: XXYYYXX
Offspring 2: YYXXXYY
67
N POINT CROSSOVER
Choose n random crossover points
Split along those points
Glue parts, alternating between parents
68
UNIFORM CORSSOVER
A random mask is generated
The mask determines which bits are copied from one parent
and which from the other parent
Bit density in mask determines how much material is taken from
the other parent
69
MUTATION
Alter each gene independently with a probability pm
pm is called the mutation rate
Typically between 1/pop_size and 1/ chromosome_length
70
SUMMARY – REPRODUCTION CYCLE
Select parents for producing the next generation
For each consecutive pair apply crossover with probability pc ,
otherwise copy parents
For each offspring apply mutation (bit-flip with probability pm)
Replace the population with the resulting population of offsprings
71
CONVERGENCE
Stop Criterion
Number of generations
Fitness value
How fit is the fittest individual
72
GA FOR FEATURE SELECTION
The initial population is randomly generated
Each chromosome is evaluated using the fitness function
The fitness values of the current population are used to find the off
springs of the next generation
The generational process ends when the termination criterion is
satisfied
The selected features correspond to the best individual in the last
generation
73
GA FOR FEATURE SELECTION
GA can be executed multiple times
Example: 15 features, GA executed 10 times
74
GA FOR FEATURE SELECTION
Feature categories based on frequency of selection
Indispensable:
Feature selected in each selected feature subset.
Irrelevant:
Feature not selected in any of the selected subsets.
Partially Relevant:
Feature selected in some of the subsets.
75
GA WORKED EXAMPLE
Suppose that we have a rotary system (which could be
mechanical - like an internal combustion engine or gas turbine,
or electrical - like an induction motor).
GA WORKED EXAMPLE
Generate a population of random strings (we’ll use ten as an
example):
77
Step
2
GA WORKED EXAMPLE
Feed each of these strings into the machine, in turn, and
measure the speed in revolutions per minute of the machine. This
value is the fitness because the higher the speed, the better the
machine:
78
Step
3
GA WORKED EXAMPLE
To select the breeding population, we’ll go the easy route and
sort the strings then delete the worst ones. First sorting:
79
Step
3
GA WORKED EXAMPLE
Having sorted the strings into order we delete the worst half
80
Step
4
GA WORKED EXAMPLE
We can now crossover the strings by pairing them up randomly.
Since there’s an odd number, we’ll use the best string twice. The
pairs are shown below:
81
Step
4
GA WORKED EXAMPLE
The crossover points are selected randomly and are shown by the
vertical lines. After crossover the strings look like this:
GA WORKED EXAMPLE
New Generation
83
Step
4
GA WORKED EXAMPLE
We have one extra string (which we
picked up by using an odd number of
strings in the mating pool) that we can
delete after fitness testing.
84
Step
5
GA WORKED EXAMPLE
Finally, we have mutation, in which a
small number of numbers are changed
85
GA WORKED EXAMPLE
After this, we repeat the algorithm from stage 2, with this new
population as the starting point.
86
GA WORKED EXAMPLE
Roulette Wheel Selection
The alternative (roulette) method of selection would make up a breeding
population by giving each of the old strings a chance of ending up in the
breeding population which is proportional to its fitness
Making the fitness for each string the addition of its own fitness with all of
those before it
87
GA WORKED EXAMPLE
Roulette Wheel Selection
88
Feature Extraction
89
FEATURE EXTRACTION
Unsupervised
Principal Component Analysis (PCA)
Independent Component Analysis (ICA)
Supervised
Linear Discriminant Analysis (LDA)
90
PRINCIPAL COMPONENT ANALYSIS
(PCA)
PCA is one of the most common feature extraction techniques
Reduce the dimensionality of a data set by finding a new set of
variables, smaller than the original set of variables
Allows us to combine much of the information contained in n
features into m features where m < n
91
PCA – INTRODUCTION
94
PCA – INTRODUCTION
Transform n-dimensional data to a new n-dimensions
The new dimension with the most variance is the first principal
component
The next is the second principal component, etc.
95
Recap
96
Recap
COVARIANCE
Focus on the sign (rather than exact value) of covariance
Positive value means that as one feature increases or decreases the other does
also (positively correlated)
Negative value means that as one feature increases the other decreases and
vice versa (negatively correlated)
A value close to zero means the features are independent
97
Recap
COVARIANCE MATRIX
Covariance matrix is an n × n matrix containing the covariance
values for all pairs of features in a data set with n features
(dimensions)
98
PCA – MAIN STEPS
Center data around 0
99
Data
100
Step
1
101
Step 2
102
Step 3
103
Step 3
Two eigenvectors
overlaying the centered
data
104
Step 4
Eigenvalue
Proportion
of Variance
1234567…n
105
Step 4
106
Step 4
107
Step 4
For our example; either keep both vectors or chose to leave out the
smaller less significant one
O
R
108
Step 5
FinalData is the final data set with data items in columns and
dimensions along the rows
109
Step 5
110
Step 5
111
Step 5
112
PCA – WORKED EXAMPLE
Getting back original data
FinalData
We used the transofrmation = RowFeatureVector x RowDataAdjust
This gives
RowDataAdjust = RowFeatureVector-1 x FinalData
113
PCA – WORKED EXAMPLE
Getting back original data
Add mean to get back raw data:
114
PCA – WORKED EXAMPLE
Principal Component Analysis for clustering gene expression data. Yeung and Ruzzo.
2001.
116
ACKNOWLEDGEMENTS
Chapter 6, Introduction to Machine Learning, E. Alpyadin, MIT Press.
Most slides in this presentation are adopted from slides of text book and various
sources. The Copyright belong to the original authors.
117