Linear Discriminant Analysis: A Detailed Tutorial: Ai Communications May 2017
Linear Discriminant Analysis: A Detailed Tutorial: Ai Communications May 2017
Linear Discriminant Analysis: A Detailed Tutorial: Ai Communications May 2017
net/publication/316994943
CITATIONS READS
33 32,629
4 authors:
Some of the authors of this publication are also working on these related projects:
All content following this page was uploaded by Aboul Ella Hassanien on 02 July 2017.
F
Germany
12 b Faculty 63
of Engineering, Suez Canal University, Egypt
O
13 64
E-mail: [email protected]
14 c Faculty of Computers and Informatics, Suez Canal University, Egypt 65
15 66
O
E-mail: [email protected]
16 d Faculty of Engineering, Mansoura University, Egypt 67
17 68
E-mail: [email protected]
PR
18 e Faculty of Computers and Information, Cairo University, Egypt 69
19 70
E-mail: [email protected]
20 71
21 72
22
Abstract. Linear Discriminant Analysis (LDA) is a very common technique for dimensionality reduction problems as a pre- 73
processing step for machine learning and pattern classification applications. At the same time, it is usually used as a black box,
D
23 74
but (sometimes) not well understood. The aim of this paper is to build a solid intuition for what is LDA, and how LDA works,
24 75
thus enabling readers of all levels be able to get a better understanding of the LDA and to know how to apply this technique in
TE
25 76
different applications. The paper first gave the basic definitions and steps of how LDA technique works supported with visual
26 explanations of these steps. Moreover, the two methods of computing the LDA space, i.e. class-dependent and class-independent 77
27 methods, were explained in details. Then, in a step-by-step approach, two numerical examples are demonstrated to show how 78
28 the LDA space can be calculated in case of the class-dependent and class-independent methods. Furthermore, two of the most 79
EC
29 common LDA problems (i.e. Small Sample Size (SSS) and non-linearity problems) were highlighted and illustrated, and state- 80
30 of-the-art solutions to these problems were investigated and explained. Finally, a number of experiments was conducted with 81
31 different datasets to (1) investigate the effect of the eigenvectors that used in the LDA space on the robustness of the extracted 82
32
feature for the classification accuracy, and (2) to show when the SSS problem occurs and how it can be addressed. 83
R
33 Keywords: Dimensionality reduction, PCA, LDA, Kernel Functions, Class-Dependent LDA, Class-Independent LDA, SSS 84
34 (Small Sample Size) problem, eigenvectors artificial intelligence 85
R
35 86
36 87
1. Introduction are two major approaches of the dimensionality reduc-
O
37 88
38 tion techniques, namely, unsupervised and supervised 89
approaches. In the unsupervised approach, there is no
C
41 92
data mining [6,33], Bioinformatics [47], biometric [61]
42
and information retrieval [73]. The main goal of the di- niques take the class labels into consideration [15,32]. 93
There are many unsupervised dimensionality reduc-
U
43 94
mensionality reduction techniques is to reduce the di-
44
mensions by removing the redundant and dependent tion techniques such as Independent Component Anal- 95
45
features by transforming the features from a higher di- ysis (ICA) [28,31] and Non-negative Matrix Factor- 96
46
mensional space that may lead to a curse of dimension- ization (NMF) [14], but the most famous technique of 97
47
ality problem, to a space with lower dimensions. There the unsupervised approach is the Principal Component 98
48 Analysis (PCA) [4,62,67,71]. This type of data reduc- 99
49 * Scientific Research Group in Egypt, (SRGE), http://www. tion is suitable for many applications such as visualiza- 100
50 egyptscience.net. tion [2,40], and noise removal [70]. On the other hand, 101
51 ** Corresponding author. E-mail: [email protected]. the supervised approach has many techniques such as 102
0921-7126/17/$35.00 © 2017 – IOS Press and the authors. All rights reserved
[research-article] p. 2/22
1 Mixture Discriminant Analysis (MDA) [25] and Neu- the authors presented different applications that used 52
2 ral Networks (NN) [27], but the most famous technique the LDA-SSS techniques such as face recognition and 53
3 of this approach is the Linear Discriminant Analysis cancer classification. Furthermore, they conducted dif- 54
4 (LDA) [50]. This category of dimensionality reduction ferent experiments using three well-known face recog- 55
5 techniques are used in biometrics [12,36], Bioinfor- nition datasets to compare between different variants 56
6 matics [77], and chemistry [11]. of the LDA technique. Nonetheless, in [57], there is 57
7 The LDA technique is developed to transform the no detailed explanation of how (with numerical exam- 58
8 features into a lower dimensional space, which max- ples) to calculate the within and between class vari- 59
9 imizes the ratio of the between-class variance to the ances to construct the LDA space. In addition, the steps 60
10 within-class variance, thereby guaranteeing maximum of constructing the LDA space are not supported with 61
11 class separability [43,76]. There are two types of LDA well-explained graphs helping for well understanding 62
F
12 technique to deal with classes: class-dependent and of the LDA underlying mechanism. In addition, the 63
class-independent. In the class-dependent LDA, one non-linearity problem was not highlighted.
O
13 64
14 separate lower dimensional space is calculated for each This paper gives a detailed tutorial about the LDA 65
15 class to project its data on it whereas, in the class- technique, and it is divided into five sections. Section 2 66
O
16 independent LDA, each class will be considered as a gives an overview about the definition of the main 67
17 separate class against the other classes [1,74]. In this idea of the LDA and its background. This section be- 68
PR
18 type, there is just one lower dimensional space for all gins by explaining how to calculate, with visual expla- 69
19 classes to project their data on it. nations, the between-class variance, within-class vari- 70
20 Although the LDA technique is considered the most ance, and how to construct the LDA space. The algo- 71
21 well-used data reduction techniques, it suffers from a rithms of calculating the LDA space and projecting the 72
22 number of problems. In the first problem, LDA fails to data onto this space to reduce its dimension are then 73
introduced. Section 3 illustrates numerical examples to
D
23 find the lower dimensional space if the dimensions are 74
24 much higher than the number of samples in the data show how to calculate the LDA space and how to select 75
matrix. Thus, the within-class matrix becomes singu- the most robust eigenvectors to build the LDA space.
TE
25 76
26 lar, which is known as the small sample problem (SSS). While Section 4 explains the most two common prob- 77
27 There are different approaches that proposed to solve lems of the LDA technique and a number of state-of- 78
28 this problem. The first approach is to remove the null the-art methods to solve (or approximately solve) these 79
problems. Different applications that used LDA tech-
EC
41 ported in [1]. However, the authors did not show the 2.1. Definition of LDA 92
42 LDA algorithm in details using numerical tutorials, vi- 93
1 ance or within-class matrix. The third step is to con- 2.2. Calculating the between-class variance (SB ) 52
2 struct the lower dimensional space which maximizes 53
3 the between-class variance and minimizes the within- 54
The between-class variance of the ith class (SBi )
4 55
class variance. This section will explain these three represents the distance between the mean of the ith
5 56
steps in detail, and then the full description of the LDA class (μi ) and the total mean (μ). LDA technique
6 57
algorithm will be given. Figures 1 and 2 are used to searches for a lower-dimensional space, which is used
7 58
visualize the steps of the LDA technique. to maximize the between-class variance, or simply
8 59
9 60
10 61
11 62
F
12 63
O
13 64
14 65
15 66
O
16 67
17 68
PR
18 69
19 70
20 71
21 72
22 73
D
23 74
24 75
TE
25 76
26 77
27 78
28 79
EC
29 80
30 81
31 82
32 83
R
33 84
34 85
R
35 86
36 87
O
37 88
38 89
C
39 90
40 91
N
41 92
42 93
U
43 94
44 95
45 96
46 97
47 98
48 99
49 100
50 101
51 Fig. 1. Visualized steps to calculate a lower dimensional subspace of the LDA technique. 102
[research-article] p. 4/22
1 52
2 53
3 54
4 55
5 56
6 57
7 58
8 59
9 60
10 61
11 62
F
12 63
O
13 64
14 Fig. 2. Projection of the original samples (i.e. data matrix) on the lower dimensional space of LDA (Vk ). 65
15 66
O
16 maximize the separation distance between classes. class and the total mean in step (B) and (C), respec- 67
17 To explain how the between-class variance or the tively. 68
PR
18 between-class matrix (SB ) can be calculated, the fol- 69
19 lowing assumptions are made. Given the original data 1 70
μj = xi (2)
20 matrix X = {x1 , x2 , . . . , xN }, where xi represents the nj x ∈ω
i j
71
21 ith sample, pattern, or observation and N is the to- 72
1
22 N c 73
tal number of samples. Each sample is represented by ni
M features (xi ∈ RM ). In other words, each sam- μ= xi = μi (3)
D
23 74
N N
24 i=1 i=1 75
ple is represented as a point in M-dimensional space.
Assume the data matrix is partitioned into c = 3
TE
25 76
where c represents the total number of classes (in our
26
classes as follows, X = [ω1 , ω2 , ω3 ] as shown in example c = 3).
77
27 78
Fig. 1 (step (A)). Each class has five samples (i.e. The term (μi − μ)(μi − μ)T in Equation (1) repre-
28 79
n1 = n2 = n3 = 5), where ni represents the number of sents the separation distance between the mean of the
EC
29 80
samples of the ith class. The total number of samples ith class (μi ) and the total mean (μ), or simply it repre-
30 81
(N ) is calculated as follows, N = 3i=1 ni . sents the between-class variance of the ith class (SBi ).
31 82
To calculate the between-class variance (SB ), the Substitute SBi into Equation (1) as follows:
32 83
separation distance between different classes which
R
33 84
34
is denoted by (mi − m) will be calculated as fol- (mi − m)2 = W T SBi W (4) 85
lows:
R
35 86
36 2 The total between-class
variance is calculated as fol- 87
(mi − m)2 = W T μi − W T μ lows, (SB = ci=1 ni SBi ). Figure 1 (step (D)) shows
O
37 88
38 first how the between-class matrix of the first class 89
= W (μi − μ)(μi − μ) W
T T
(1) (SB1 ) is calculated and then how the total between-
C
39 90
40 class matrix (SB ) is then calculated by adding all the 91
where mi represents the projection of the mean of the between-class matrices of all classes.
N
41 92
42
ith class and it is calculated as follows, mi = W T μi , 93
where m is the projection of the total mean of all 2.3. Calculating the within-class variance (SW )
U
43 94
44 classes and it is calculated as follows, m = W T μ, W 95
45 represents the transformation matrix of LDA,1 μi (1 × The within-class variance of the ith class (SWi ) rep- 96
46 M) represents the mean of the ith class and it is com- resents the difference between the mean and the sam- 97
47 puted as in Equation (2), and μ(1 × M) is the to- ples of that class. LDA technique searches for a lower- 98
48 tal mean of all classes and it can be computed as in dimensional space, which is used to minimize the dif- 99
49 Equation (3) [36,83]. Figure 1 shows the mean of each ference between the projected mean (mi ) and the pro- 100
50 jected samples of each class (W T xi ), or simply min- 101
51 1 The transformation matrix (W ) will be explained in Section 2.4. imizes the within-class variance [36,83]. The within- 102
[research-article] p. 5/22
1 class variance of each class (SWj ) is calculated as in where λ represents the eigenvalues of the transfor- 52
2 Equation (5). mation matrix (W ). The solution of this problem 53
3 T 2 can be obtained by calculating the eigenvalues (λ = 54
4 W xi − mj {λ1 , λ2 , . . . , λM }) and eigenvectors (V = {v1 , v2 , . . . , 55
−1
5 xi ∈ωj ,j =1,...,c vM }) of W = SW SB , if SW is non-singular [36,81,83]. 56
6 2 The eigenvalues are scalar values, while the eigen- 57
7
= W T xij − W T μj vectors are non-zero vectors, which satisfies the Equa- 58
8 xi ∈ωj ,j =1,...,c tion (8) and provides us with the information about 59
9
the LDA space. The eigenvectors represent the direc- 60
= W T (xij − μj )2 W
10 tions of the new space, and the corresponding eigen- 61
xi ∈ωj ,j =1,...,c
11 values represent the scaling factor, length, or the mag- 62
F
12
= W (xij − μj )(xij − μj ) W
T T nitude of the eigenvectors [34,59]. Thus, each eigen- 63
vector represents one axis of the LDA space, and
O
13 64
xi ∈ωj ,j =1,...,c
14
the associated eigenvalue represents the robustness 65
15 = W T SWj W (5) of this eigenvector. The robustness of the eigenvec- 66
O
16 xi ∈ωj ,j =1,...,c tor reflects its ability to discriminate between differ- 67
17 ent classes, i.e. increase the between-class variance, 68
PR
18 From Equation (5), the within-class variance for and decreases the within-class variance of each class; 69
19 each class can be calculated as follows, SWj = djT ∗ hence meets the LDA goal. Thus, the eigenvectors with 70
nj the k highest eigenvalues are used to construct a lower
20 dj = i=1 (xij −μj )(xij −μj )T , where xij represents 71
21 the ith sample in the j th class as shown in Fig. 1 (step dimensional space (Vk ), while the other eigenvectors 72
22 (E), (F)), and dj is the centering data of the j th class, ({vk+1 , vk+2 , vM }) are neglected as shown in Fig. 1 73
nj (step (G)).
i.e. dj = ωj − μj = {xi }i=1 − μj . Moreover, step (F)
D
23 74
Figure 2 shows the lower dimensional space of the
24 in the figure illustrates how the within-class variance 75
LDA technique, which is calculated as in Fig. 1 (step
of the first class (SW1 ) in our example is calculated.
TE
25 76
26 The total within-class variance represents the sum of (G)). As shown, the dimension of the original data ma- 77
27 all within-class matrices of all classes (see Fig. 1 (step trix (X ∈ RN×M ) is reduced by projecting it onto the 78
28 (F))), and it can be calculated as in Equation (6). lower dimensional space of LDA (Vk ∈ RM×k ) as de- 79
noted in Equation (9) [81]. The dimension of the data
EC
29 80
30
3 after projection is k; hence, M − k features are ig- 81
SW = SWi nored or deleted from each sample. Thus, each sample
31 82
i=1 (xi ) which was represented as a point a M-dimensional
32
space will be represented in a k-dimensional space by
83
R
33 = (xi − μ1 )(xi − μ1 )T 84
projecting it onto the lower dimensional space (Vk ) as
34 xi ∈ω1 85
follows, yi = xi Vk .
R
35 86
36 + (xi − μ2 )(xi − μ2 )T 87
xi ∈ω2 Y = XVk (9)
O
37 88
38
+ (xi − μ3 )(xi − μ3 ) T
(6) Figure 3 shows a comparison between two lower- 89
C
39 90
xi ∈ω3 dimensional sub-spaces. In this figure, the original data
40 91
which consists of three classes as in our example are
2.4. Constructing the lower dimensional space
N
41 92
plotted. Each class has five samples, and all samples
42
are represented by two features only (xi ∈ R2 ) to 93
After calculating the between-class variance (SB )
U
43 94
be visualized. Thus, each sample is represented as a
44
and within-class variance (SW ), the transformation ma- 95
trix (W ) of the LDA technique can be calculated as in point in two-dimensional space. The transformation
45
Equation (7), which is called Fisher’s criterion. This matrix (W (2 × 2)) is calculated using the steps in 96
46 Section 2.2, 2.3, and 2.4. The eigenvalues (λ1 and λ2 ) 97
formula can be reformulated as in Equation (8).
47 and eigenvectors (i.e. sub-spaces) (V = {v1 , v2 }) of 98
48 W T SB W W are then calculated. Thus, there are two eigenvec- 99
49 arg max T (7) tors or sub-spaces. A comparison between the two 100
W W SW W
50 lower-dimensional sub-spaces shows the following no- 101
51 SW W = λSB W (8) tices: 102
[research-article] p. 6/22
F
12 63
O
13 64
14 65
15 In this section the detailed steps of the algorithms of 66
O
16 the two LDA methods are presented. As shown in Al- 67
17 gorithms 1 and 2, the first four steps in both algorithms 68
PR
18 are the same. Table 1 shows the notations which are 69
19
Fig. 3. A visualized comparison between the two lower-dimensional
used in the two algorithms. 70
20 71
sub-spaces which are calculated using three different classes.
21 2.7. Computational complexity of LDA 72
22
• First, the separation distance between different 73
In this section, the computational complexity for
D
23 74
classes when the data are projected on the first
24
eigenvector (v1 ) is much greater than when the LDA is analyzed. The computational complexity for 75
the first four steps, common in both class-dependent
TE
25 76
data are projected on the second eigenvector (v2 ).
26
As shown in the figure, the three classes are effi- and class-independent methods, are computed as fol- 77
27
ciently discriminated when the data are projected lows. As illustrated in Algorithm 1, in step (2), to cal- 78
28
on v1 . Moreover, the distance between the means culate the mean of the ith class, there are ni M addi- 79
tions and M divisions, i.e., in total, there are (N M +
EC
29
of the first and second classes (m1 −m2 ) when the 80
30
original data are projected on v1 is much greater cM) operations. In step (3), there are N M additions 81
31
than when the data are projected on v2 , which re- and M divisions, i.e., there are (NM + M) opera- 82
32
flects that the first eigenvector discriminates the tions. The computational complexity of the fourth step 83
R
33
three classes better than the second one. is c(M + M 2 + M 2 ), where M is for μi − μ, M 2 for 84
34
• Second, the within-class variance when the data (μi − μ)(μi − μ)T , and the last M 2 is for the multipli- 85
cation between ni and the matrix (μi −μ)(μi −μ)T . In
R
35 86
are projected on v1 is much smaller than when it
36 the fifth step, there are N (M + M 2 ) operations, where 87
projected on v2 . For example, SW1 when the data
M is for (xij −μj ) and M 2 is for (xij −μj )(xij −μj )T .
O
37 88
are projected on v1 is much smaller than when the
38 In the sixth step, there are M 3 operations to calculate 89
data are projected on v2 . Thus, projecting the data −1 −1
SW , M 3 is for the multiplication between SW and
C
39 90
on v1 minimizes the within-class variance much
40
better than v2 . SB , and M 3 to calculate the eigenvalues and eigenvec- 91
tors. Thus, in class-independent method, the computa-
N
41 92
42 From these two notes, we conclude that the first eigen- tional complexity is O(N M 2 ) if N > M; otherwise, 93
vector meets the goal of the lower-dimensional space the complexity is O(M 3 ).
U
43 94
44 of the LDA technique than the second eigenvector; In Algorithm 2, the number of operations to cal- 95
45 hence, it is selected to construct a lower-dimensional culate the within-class variance for each class SWj in 96
46 space. the sixth step is nj (M + M 2 ), and to calculate SW , 97
47 N (M + M 2 ) operations are needed. Hence, calculat- 98
48 2.5. Class-dependent vs. class-independent methods ing the within-class variance for both LDA methods 99
49 are the same. In the seventh step and eighth, there 100
50 The aim of the two methods of the LDA is to calcu- are M 3 operations for the inverse, M 3 for the multi- 101
−1
51 late the LDA space. In the class-dependent LDA, one plication of SW S , and M 3 for calculating eigenval-
i B
102
[research-article] p. 7/22
1 Algorithm 1 Linear Discriminant Analysis (LDA): Algorithm 2 Linear Discriminant Analysis (LDA): 52
2 Class-Independent Class-Dependent 53
3
1: Given a set of N samples [xi ]N
i=1 ,
each of which is 1: Given a set of N samples [xi ]N
i=1 ,
each of which is 54
4 55
represented as a row of length M as in Fig. 1 (step represented as a row of length M as in Fig. 1 (step
5
(A)), and X(N × M) is given by, (A)), and X(N × M) is given by, 56
6 57
7
⎡ ⎤ ⎡ ⎤ 58
x(1,1) x(1,2) ... x(1,M) x(1,1) x(1,2) ... x(1,M)
8 ⎢ x(2,1) x(2,2) ... x(2,M) ⎥ ⎢ x(2,1) x(2,2) ... x(2,M) ⎥ 59
⎢ ⎥ ⎢ ⎥
9 X=⎢ . .. .. .. ⎥ (10) X=⎢ . .. .. .. ⎥ (13) 60
10
⎣ .. . . . ⎦ ⎣ .. . . . ⎦ 61
11 x(N,1) x(N,2) ... x(N,M) x(N,1) x(N,2) ... x(N,M) 62
F
12 63
Compute the mean of each class μi (1 × M) as in Compute the mean of each class μi (1 × M) as in
O
13 64
2: 2:
14 65
Equation (2). Equation (2).
15
3: Compute the total mean of all data μ(1 × M) as in Compute the total mean of all data μ(1 × M) as in 66
O
3:
16 67
Equation (3). Equation (3).
17
4: Calculate between-class matrix SB (M ×M) as fol- 4: Calculate between-class matrix SB (M × M) as in 68
PR
18 69
lows: Equation (11)
19
5: for all Class i, i = 1, 2, . . . , c do 70
20
c
6: Compute within-class matrix of each class 71
SB = ni (μi − μ)(μi − μ)T (11)
21
SWi (M × M), as follows: 72
22 i=1 73
D
23 74
24 5: Compute within-class matrix SW (M × M), as fol- SWj = (xi − μj )(xi − μj )T (14) 75
xi ∈ωj
lows:
TE
25 76
26 77
nj
27
c
7: Construct a transformation matrix for each class 78
28
SW = (xij − μj )(xij − μj )T (12) (Wi ) as follows: 79
j =1 i=1
EC
29 80
−1
30 Wi = SW S
i B
(15) 81
31
where xij represents the ith sample in the j th 82
class.
32
6: From Equation (11) and (12), the matrix W that 8: The eigenvalues (λi )
and eigenvectors of (V i ) 83
R
33 84
maximizing Fisher’s formula which is defined in each transformation matrix (Wi ) are then calcu-
34
Equation (7) is calculated as follows, W = SW−1
SB . lated, where λi and V i represent the calculated 85
R
35 86
The eigenvalues (λ) and eigenvectors (V ) of W are eigenvalues and eigenvectors of the ith class, re-
36 87
then calculated. spectively.
O
37 88
7: Sorting eigenvectors in descending order accord- 9: Sorting the eigenvectors in descending order ac-
38 89
ing to their corresponding eigenvalues. The first k cording to their corresponding eigenvalues. The
C
39 90
eigenvectors are then used as a lower dimensional first k eigenvectors are then used to construct a
40
space (Vk ). lower dimensional space for each class Vki . 91
N
41 92
8: Project all original samples (X) onto the lower di- 10: Project the samples of each class (ωi ) onto their
42
mensional space of LDA as in Equation (9). lower dimensional space (Vki ), as follows: 93
U
43 94
44 j 95
j = xi Vk , xi ∈ ω j (16)
45 96
46 97
ues and eigenvectors. These two steps are repeated for where j represents the projected samples of
47 98
each class which increases the complexity of the class- the class ωj .
48 99
11: end for
49 dependent algorithm. Totally, the computational com- 100
50 plexity of the class-dependent algorithm is O(N M 2 ) if 101
51 N > M; otherwise, the complexity is O(cM 3 ). Hence, 102
[research-article] p. 8/22
Table 1
1 numbers are rounded up to the nearest hundredths 52
Notation
2 (i.e. only two digits after the decimal point are dis- 53
3 Notation Description played). 54
4 X Data matrix The first four steps of both class-independent and 55
5 N Total number of samples in X class-dependent methods are common as illustrated in 56
6 W Transformation matrix Algorithms 1 and 2. Thus, in this section, we show how 57
7 ni Number of samples in ωi these steps are calculated. 58
8 μi The mean of the ith class Given two different classes, ω1 (5 × 2) and ω2 (6 × 2) 59
9 μ Total or global mean of all samples have (n1 = 5) and (n2 = 6) samples, respectively. 60
10 SW i Within-class variance or scatter matrix of the ith class Each sample in both classes is represented by two fea- 61
11 (ωi ) tures (i.e. M = 2) as follows: 62
F
12 SBi Between-class variance of the ith class (ωi ) 63
V Eigenvectors of W
⎡ ⎤
1.00 2.00
O
13 64
14 Vi ith eigenvector ⎢2.00 3.00⎥ 65
⎢ ⎥
15 xij The ith sample in the j th class ω1 = ⎢
⎢3.00 3.00⎥⎥ and 66
O
16 k The dimension of the lower dimensional space (Vk ) ⎣4.00 5.00⎦ 67
17 xi ith sample 5.00 5.00 68
PR
18 M Dimension of X or the number of features of X ⎡ ⎤ (17)
69
19 Vk The lower dimensional space 4.00 2.00 70
⎢5.00 0.00⎥
20 c Total number of classes ⎢ ⎥ 71
⎢5.00 2.00⎥
21 mi The mean of the ith class after projection
ω2 = ⎢
⎢3.00
⎥ 72
22 m The total mean of all classes after projection ⎢ 2.00⎥⎥ 73
SW Within-class variance ⎣5.00 3.00⎦
D
23 74
24 SB Between-class variance 6.00 3.00 75
λ Eigenvalue matrix
TE
25 76
26 λi ith eigenvalue To calculate the lower dimensional space using 77
27 Y Projection of the original data LDA, first the mean of each class μj is calculated. The 78
28 ωi ith Class total mean μ(1×2) is then calculated, which represents 79
the mean of all means of all classes. The values of the
EC
29 80
30 the class-dependent method needs computations more mean of each class and the total mean are shown below, 81
31 than class-independent method.
82
32 In our case, we assumed that there are 40 classes μ1 = 3.00 3.60 , 83
R
33 and each class has ten samples. Each sample is repre- μ2 = 4.67 2.00 , and (18) 84
34 sented by 4096 features (M > N ). Thus, the compu- 5
85
μ = 11 6
μ2 = 3.91 2.727
R
39 90
40 3. Numerical examples the first class (SB1 ) is equal to, 91
N
41 92
42 In this section, two numerical examples will be SB1 = n1 (μ1 − μ)T (μ1 − μ) 93
43 94
= 5 −0.91 0.87 −0.91 0.87
44 steps to calculate the LDA space and how the LDA 95
45 technique is used to discriminate between only two 4.13 −3.97 96
= (19)
46 different classes. In the first example, the lower- −3.97 3.81 97
47 dimensional space is calculated using the class- 98
48 independent method, while in the second example, the Similarly, SB2 is calculated as follows: 99
49 class-dependent method is used. Moreover, a compar- 100
50 ison between the lower dimensional spaces of each 3.44 −3.31 101
SB2 = (20)
51 method is presented. In all numerical examples, the −3.31 3.17 102
[research-article] p. 9/22
1 The total between-class variance is calculated a fol- class matrix for each class and the total within-class 52
2 lows: matrix are as follows: 53
3 54
10.00 8.00
4
SB = SB1 + SB2 SW1 = ,
55
5
8.00 7.20 56
6 4.13 −3.97 57
= 5.33 1.00
7 −3.97 3.81 SW2 = , (23) 58
1.00 6.00
8 59
3.44 −3.31
9 + 15.33 9.00 60
−3.31 3.17 SW =
10
9.00 13.20 61
11 7.58 −7.27 62
=
F
(21)
12 −7.27 6.98 The transformation matrix (W ) in the class-independent 63
−1
method can be obtained as follows, W = SW SB , and
O
13 64
−1
14 To calculate the within-class matrix, first subtract the values of (SW ) and (W ) are as follows: 65
15 66
the mean of each class from each sample in that class
O
16
and this step is called mean-centering data and it is cal- −1 0.11 −0.07 67
SW = and
17
culated as follows, di = ωi − μi , where di represents −0.07 0.13 68
PR
18
centering data of the class ωi . The values of d1 and d2 (24) 69
1.36 −1.31
19
are as follows: W = 70
20 −1.48 1.42 71
21 ⎡ ⎤ 72
−2.00 −1.60 The eigenvalues (λ(2 × 2)) and eigenvectors (V (2 ×
22 ⎢−1.00 −0.60⎥ 73
⎢ ⎥ 2)) of W are then calculated as follows:
d1 = ⎢ ⎥
D
23 74
⎢ 0.00 −0.60⎥ and
24
⎣ 1.00 1.40 ⎦ 0.00 0.00
75
λ=
TE
25
2.00 1.40 and 76
0.00 2.78
26
⎡ ⎤ (22) (25)
77
27 −0.67 0.00 −0.69 0.68 78
⎢ 0.33 −2.00⎥ V =
28
⎢ ⎥ −0.72 −0.74 79
⎢ 0.33 0.00 ⎥
EC
d2 = ⎢ ⎥
29 80
⎢−1.67 0.00 ⎥
30
⎢ ⎥ From the above results it can be noticed that, the 81
31 ⎣ 0.33 1.00 ⎦ second eigenvector (V2 ) has corresponding eigenvalue 82
32 1.33 1.00 more than the first one (V1 ), which reflects that, the 83
R
35 are used to calculate the LDA space. sional space. The original data is projected on the 86
36 lower dimensional space, as follows, yi = ωi V2 , where 87
O
39 90
40 y1 = ω1 V2 91
⎡ ⎤
N
41 92
42
In this section, the LDA space is calculated us- 1.00 2.00 93
ing the class-independent method. This method rep- ⎢2.00 3.00⎥
⎢ ⎥ 0.68
U
43 94
resents the standard method of LDA as in Algo- ⎢
= ⎢3.00 3.00⎥⎥
⎣4.00 5.00⎦ −0.74
44 95
45 rithm 1. 96
46 After centring the data, the within-class variance 5.00 5.00 97
for each class (SWi (2 × 2)) is calculated as follows, ⎡ ⎤
47
nj −0.79 98
48 SWj = djT ∗ dj = i=1 (xij − μj )T (xij − μj ), where ⎢−0.85⎥ 99
⎢ ⎥
49 xij represents the ith sample in the j th class. The total =⎢⎢−0.18⎥
⎥ (26) 100
50 within-class matrix (SW (2 × 2)) is then calculated as ⎣−0.97⎦ 101
c
51 follows, SW = i=1 SWi . The values of the within- −0.29 102
[research-article] p. 10/22
F
12 follows: 63
O
13 64
−1
14 W 1 = SW 1
SB 65
15 −1 66
O
10.00 8.00 7.58 −7.27
16
= 67
17 8.00 7.20 −7.27 6.98 68
PR
18
0.90 −1.00 7.58 −7.27 69
=
19
−1.00 1.25 −7.27 6.98 70
20
71
21 14.09 −13.53 72
= (28)
22 −16.67 16.00 73
D
23 74
24 Similarly, W2 is calculated as follows: 75
Fig. 4. Probability density function of the projected data of the first
TE
25 76
example, (a) the projected data on V1 , (b) the projected data on V2 .
1.70 −1.63
26
W2 = (29) 77
27 Similarly, y2 is as follows: −1.50 1.44 78
28
⎡ ⎤ The eigenvalues (λi ) and eigenvectors (Vi ) for each
79
EC
29 1.24 80
⎢3.39⎥ transformation matrix (Wi ) are calculated, and the val-
30
⎢ ⎥ 81
⎢1.92⎥ ues of the eigenvalues and eigenvectors are shown be-
y2 = ω2 V2 = ⎢ ⎥
31 82
⎢0.56⎥ (27) low.
32
⎢ ⎥ 83
⎣1.18⎦
R
33 84
0.00 0.00
34 1.86 λω 1 = and 85
0.00 30.01
R
35
(30) 86
36 Figure 4 illustrates a probability density function −0.69 0.65 87
Vω1 =
(pdf) graph of the projected data (yi ) on the two eigen- −0.72 −0.76
O
37 88
38 vectors (V1 and V2 ). A comparison of the two eigen- 89
vectors reveals the following: 3.14 0.00
λω2 =
C
39 and 90
0.00 0.00
40 • The data of each class is completely discriminated (31) 91
N
43 94
44 the between-class variance more than the first where λωi and Vωi represent the eigenvalues and eigen- 95
45 one. vectors of the ith class, respectively. 96
46 • The within-class variance (i.e. the variance be- From the results shown (above) it can be seen that, 97
{2}
47 tween the same class samples) of the two classes the second eigenvector of the first class (Vω1 ) has cor- 98
48 are minimized when the data are projected on the responding eigenvalue more than the first one; thus, 99
49 second eigenvector. As shown in Fig. 4(b), the the second eigenvector is used as a lower dimensional 100
{2}
50 within-class variance of the first class is small space for the first class as follows, y1 = ω1 ∗ Vω1 , 101
51 compared with Fig. 4(a). where y1 represents the projection of the samples of 102
[research-article] p. 11/22
F
12 {2}
class-dependent method, the first class is projected on Vω1 , while
vector. Figure 6 illustrates that the within-class 63
{1} variance of the first class (SW1 ) was much
O
13 the second class is projected on Vω2 . 64
14 smaller when it was projected on V2 than V1 . 65
15 ∗ As a result of the above two findings, V2 is used 66
O
the first class. While, the first eigenvector in the second
16 {1} to construct the LDA space. 67
class (Vω2 ) has corresponding eigenvalue more than
17 68
{1}
• Class-Dependent: As shown from the figure, there
PR
18
the second one. Thus, Vω2 is used to project the data of 69
{1} {2} {1}
19 the second class as follows, y2 = ω2 ∗ Vω2 , where y2 are two eigenvectors, Vω1 (red line) and Vω2 70
20 represents the projection of the samples of the second (blue line), which represent the first and second 71
21 class. The values of y1 and y2 will be as follows: classes, respectively. The differences between the 72
22 ⎡ ⎤ two eigenvectors are as following: 73
⎡ ⎤ 1.68
D
23
−0.88 ⎢3.76⎥ ∗ Projecting the original data on the two eigen-
74
24 ⎢−1.00⎥ ⎢ ⎥ 75
⎢ ⎥ ⎢2.43⎥ vectors discriminates between the two classes.
y1 = ⎢ ⎥ ⎢ ⎥
TE
⎢−0.35⎥ and y2 = ⎢
25 76
⎥ (32)
26 ⎣−1.24⎦ ⎢0.93⎥ As shown in the figure, the distance between 77
⎣1.77⎦ the projected means m1 − m2 is larger than the
27
−0.59 78
28 2.53 distance between the original means μ1 − μ2 . 79
EC
{2}
33
• First, the projection data of the two classes are ∗ As a result of the above two findings, Vω1 and 84
34 {1} 85
efficiently discriminated. Vω2 are used to construct the LDA space.
R
35 86
• Second, the within-class variance of the projected
36
samples is lower than the within-class variance of • Class-Dependent vs. Class-Independent: The two 87
O
37 88
the original samples. LDA methods are used to calculate the LDA
38 89
space, but a class-dependent method calculates
C
39 90
separate lower dimensional spaces for each class
40 91
3.3. Discussion which has two main limitations: (1) it needs
N
41 92
more CPU time and calculations more than class-
42 93
In these two numerical examples, the LDA space is independent method; (2) it may lead to SSS prob-
U
43 94
calculated using class-dependent and class-independent lem because the number of samples in each class
44 95
methods. affects the singularity of SWi .2
45 96
Figure 6 shows a further explanation of the two
46 97
methods as following: These findings reveal that the standard LDA technique
47 98
used the class-independent method rather than using
48 • Class-Independent: As shown from the figure, 99
the class-dependent method.
49 there are two eigenvectors, V1 (dotted black line) 100
50 and V2 (solid black line). The differences between 101
51 the two eigenvectors are as follows: 2 SSS problem will be explained in Section 4.2. 102
[research-article] p. 12/22
1 52
2 53
3 54
4 55
5 56
6 57
7 58
8 59
9 60
10 61
11 62
F
12 63
O
13 64
14 65
15 66
O
16 67
17 68
PR
18 69
19 70
20 71
21 72
22 73
D
23 74
24 75
TE
25 76
26 77
27 78
28 79
Fig. 6. Illustration of the example of the two different methods of LDA methods. The blue and red lines represent the first and second eigenvectors
EC
29 80
of the class-dependent approach, respectively, while the solid and dotted black lines represent the second and first eigenvectors of class-indepen-
30 dent approach, respectively. 81
31 82
32 4. Main problems of LDA The mathematical interpretation for this problem is as 83
R
35 duction techniques, it suffers from two main problems: space cannot be calculated. 86
36 the Small Sample Size (SSS) and linearity problems. One of the solutions of this problem is based on the 87
O
37 In the next two subsections, these two problems will transformation concept, which is known as a kernel
88
38 be explained, and some of the state-of-the-art solutions methods or functions [3,50]. Figure 7 illustrates how
89
C
39 are highlighted. the transformation is used to map the original data into
90
40 91
a higher dimensional space; hence, the data will be lin-
N
1 52
2 53
3 54
4 55
5 56
6 57
7 58
8 59
9 60
10 61
11 62
F
12 63
O
13 64
14 65
15 66
O
16 67
17 68
PR
18 69
19 70
20 71
21 72
22
Fig. 8. Example of kernel functions, the samples lie on the top panel 73
(X) which are represented by a line (i.e. one-dimensional space) are
D
23 74
non-linearly separable, where the samples lie on the bottom panel
24 (Z) which are generated from mapping the samples of the top space 75
are linearly separable.
TE
25 76
26
Fig. 7. Two examples of two non-linearly separable classes, top φ i 77
27 where μi = n1i ni=1 φ{xi } and μφ = N1 × 78
panel shows how the two classes are non-separable, while the bot- N c ni φ
i=1 φ{xi } =
28 79
i=1 N μi .
tom shows how the transformation solves this problem and the two
EC
29 classes are linearly separable. Thus, in kernel LDA, all samples are transformed 80
30
non-linearly into a new space Z using the function φ. 81
31 82
transformation matrix (W ) in the new feature space In other words, the φ function is used to map the
32
original features into Z space by creating a non- 83
(Z) is calculated as in Equation (33).
R
33 84
linear combination of the original samples using a
34 85
dot-products of it [3]. There are many types of ker-
R
35
T φ nel functions to achieve this aim. Examples of these 86
W SB W
F (W ) = max
36 87
(33) function include Gaussian or Radial Basis Function
O
φ
37
W T SW W (RBF), K(xi , xj ) = exp(−xi − xj 2 /2σ 2 ), where 88
38 σ is a positive parameter, and the polynomial kernel 89
39 90
where W is a transformation matrix and Z is the new
40 φ 72]. 91
feature space. The between-class matrix (SB ) and the
N
41 92
φ
42
within-class matrix (SW ) are defined as follows:
4.2. Small sample size problem 93
U
43 94
44
c
4.2.1. Problem definition 95
φ φ φ φ T
SB = n i μi − μ φ
μi − μ (34)
45 Singularity, Small Sample Size (SSS), or under- 96
i=1
46 sampled problem is one of the big problems of LDA 97
nj
47
φ
c
φ
technique. This problem results from high-dimensional 98
48 SW = φ{xij } − μj pattern classification tasks or a low number of train- 99
49 j =1 i=1 ing samples available for each class compared with 100
50 φ T the dimensionality of the sample space [30,38,82, 101
51
× φ{xij } − μj (35) 85]. 102
[research-article] p. 14/22
1 The SSS problem occurs when the SW is singular.3 Four different variants of the LDA technique that are 52
2 The upper bound of the rank4 of SW is N − c, while used to solve the SSS problem are introduced as fol- 53
3 the dimension of SW is M × M [17,38]. Thus, in most lows: 54
4 cases M N − c which leads to SSS problem. For 55
PCA+LDA technique. In this technique, the orig-
5 example, in face recognition applications, the size of 56
6 the face image my reach to 100×100 = 10,000 pixels, inal d-dimensional features are first reduced to h- 57
7 which represent high-dimensional features and it leads dimensional feature space using PCA, and then the 58
8 to a singularity problem. LDA is used to further reduce the features to k- 59
9
dimensions. The PCA is used in this technique to re- 60
10
4.2.2. Common solutions to SSS problem: duce the dimensions to make the rank of SW is N − c 61
There are many studies that proposed many solu- as reported in [4]; hence, the SSS problem is addressed.
11 62
tions for this problem; each has its advantages and
F
12
However, the PCA neglects some discriminant infor- 63
drawbacks. mation, which may reduce the classification perfor-
O
13 64
14 • Regularization (RLDA): In regularization mance [57,60]. 65
15 method, the identity matrix is scaled by multi- Direct LDA technique. Direct LDA (DLDA) is one 66
O
16 plying it by a regularization parameter (η > 0) of the well-known techniques that are used to solve 67
17 and adding it to the within-class matrix to make the SSS problem. This technique has two main steps 68
PR
18 it non-singular [18,38,45,82]. Thus, the diagonal [83]. In the first step, the transformation matrix, W , is 69
19
components of the within-class matrix are biased computed to transform the training data to the range 70
20
as follows, SW = SW + ηI . However, choosing space of SB . In the second step, the dimensionality of 71
the value of the regularization parameter requires
21 the transformed data is further transformed using some 72
more tuning and a poor choice for this parame-
22 regulating matrices as in Algorithm 4. The benefit of 73
ter can degrade the performance of the method
D
23 the DLDA is that there is no discriminative features are 74
[38,45]. Another problem of this method is that
24 neglected as in PCA+LDA technique [83]. 75
the parameter η is just added to perform the in-
TE
25 76
26
verse of SW and has no clear mathematical inter- Regularized LDA technique. In the Regularized LDA 77
pretation [38,57]. (RLDA), a small perturbation is add to the SW matrix
27 78
• Sub-space: In this method, a non-singular inter- to make it non-singular as mentioned in [18]. This reg-
28 79
mediate space is obtained to reduce the dimen- ularization can be applied as follows:
EC
29 80
sion of the original data to be equal to the rank
30 81
31
of SW ; hence, SW becomes full-rank,5 and then (SW + ηI )−1 SB wi = λi wi (36) 82
SW can be inverted. For example, Belhumeur et
32 83
al. [4] used PCA, to reduce the dimensions of the where η represents a regularization parameter. The di-
R
33 84
original space to be equal to N − c (i.e. the upper agonal components of the SW are biased by adding this
34 85
bound of the rank of SW ). However, as reported small perturbation [13,18]. However, the regularization
R
35 86
in [22], losing some discriminant information is a parameter need to be tuned and poor choice of it can
36 87
common drawback associated with the use of this degrade the generalization performance [57].
O
37 88
method.
38
• Null Space: There are many studies proposed Null LDA technique. The aim of the NLDA technique 89
C
39
to remove the null space of SW to make SW is to find the orientation matrix W , and this can be 90
40
full-rank; hence, invertible. The drawback of this achieved using two steps. In the first step, the range 91
N
41
method is that more discriminant information is space of the SW is neglected, and the data are projected 92
42
lost when the null space of SW is removed, which only on the null space of SW as follows, SW W = 0. In 93
1 nique [54]. Mathematically, in the Null LDA (NLDA) or dimensions. Due to this high dimensionality, the 52
2 technique, the h column vectors of the transformation computational models need more time to train their 53
3 matrix W = [w1 , w2 , . . . , wh ] are taken to be the models, which may be infeasible and expensive. More- 54
4 null space of the SW as follows, wiT SW wi = 0, ∀i = over, this high dimensionality reduces the classifica- 55
5 1, . . . , h, where wiT SB wi = 0. Hence, M −(N −c) lin- tion performance of the computational model and in- 56
6 early independent vectors are used to form a new ori- creases its complexity. This problem can be solved us- 57
7 entation matrix, which is used to maximize |W T SB W | ing LDA technique to construct a new set of features 58
8 subject to the constraint |W T SW W | = 0 as in Equa- from a large number of original features. There are 59
9 many papers have been used LDA in medical applica- 60
tion (37).
10 tions [8,16,39,52–55]. 61
11 62
T
F
12 W = arg max W SB W (37) 6. Packages 63
|W T SW W |=0
O
13 64
14 In this section, some of the available packages that 65
15 are used to compute the space of LDA variants. For 66
O
16 5. Applications of the LDA technique example, WEKA6 is a well-known Java-based data 67
17 mining tool with open source machine learning soft- 68
PR
18 In many applications, due to the high number of fea- ware such as classification, association rules, regres- 69
19 tures or dimensionality, the LDA technique have been sion, pre-processing, clustering, and visualization. In 70
20 used. Some of the applications of the LDA technique WEKA, the machine learning algorithms can be ap- 71
21 and its variants are described as follows: plied directly on the dataset or called from person’s 72
22
Java code. XLSTAT7 is another data analysis and sta- 73
5.1. Biometrics applications tistical package for Microsoft Excel that has a wide
D
23 74
variety of dimensionality reduction algorithms includ-
24 75
Biometrics systems have two main steps, namely, ing LDA. dChip8 package is also used for visualiza-
TE
25 76
tion of gene expression and SNP microarray including
26 feature extraction (including pre-processing steps) and 77
some data analysis algorithms such as LDA, cluster-
27 recognition. In the first step, the features are extracted 78
ing, and PCA. LDA-SSS9 is a Matlab package, and it
28 from the collected data, e.g. face images, and in the 79
contains several algorithms related to the LDA tech-
EC
29 second step, the unknown samples, e.g. unknown face niques and its variants such as DLDA, PCA+LDA, 80
30 image, is identified/verified. The LDA technique and and NLDA. MASS10 package is based on R, and it has 81
31 its variants have been applied in this application. For functions that are used to perform linear and quadratic 82
32 example, in [10,20,41,68,75,83], the LDA technique discriminant function analysis. Dimensionality reduc- 83
R
33 have been applied on face recognition. Moreover, the tion11 package is mainly written in Matlab, and it has 84
34 LDA technique was used in Ear [84], fingerprint [44], a number of dimensionality reduction techniques such 85
R
35 gait [5], and speech [24] applications. In addition, the as ULDA, QLDA, and KDA. DTREG12 is a software 86
36 LDA technique was used with animal biometrics as in package that is used for medical data and modeling 87
O
39 90
40 91
In agriculture applications, an unknown sample can
N
41 92
7. Experimental results and discussion
42 be classified into a pre-defined species using computa- 93
tional models [64]. In this application, different vari- In this section, two experiments were conducted to
U
43 94
44 ants of the LDA technique was used to reduce the di- illustrate: (1) how the LDA is used for different appli- 95
45 mension of the collected features as in [9,21,26,46,63, 96
6 http:/www.cs.waikato.ac.nz/ml/weka/
46 64]. 97
7 http://www.xlstat.com/en/
47 98
8 https://sites.google.com/site/dchipsoft/home
48 5.3. Medical applications 9 http://www.staff.usp.ac.fj/sharma_al/index.htm 99
49 10 http://www.statmethods.net/advstats/discriminant.html 100
50 In medical applications, the data such as the DNA 11 http://www.public.asu.edu/*jye02/Software/index.html 101
51 microarray data consists of a large number of features 12 http://www.dtreg.com/index.htm 102
[research-article] p. 16/22
Table 2
1 cations, (2) what is the relation between its parame- 52
Dataset description
2 ter (Eigenvectors) and the accuracy of a classification 53
3 problem, (3) when the SSS problem could appear and Dataset Dimension (M) No. of Samples (N ) No. of classes (c) 54
4 a method for solving it. ORL64×64 4096 400 40 55
5 ORl32×32 1024 56
6 7.1. Experimental setup Ear64×64 4096 102 17 57
7 Ear32×32 1024 58
8
This section gives an overview of the databases, Yale64×64 4096 165 15 59
9
the platform, and the machine specification used to Yale32×32 1024 60
10 61
conduct our experiments. Different biometric datasets
11
were used in the experiments to show how the LDA the classification accuracy, the Nearest Neighbour clas- 62
F
12
using its parameter behaves with different data. These sifier was used. This classifier aims to classify the test- 63
O
13
datasets are described as follows: ing image by comparing its position in the LDA space 64
14 with the positions of training images. Furthermore, 65
15 • ORL dataset13 face images dataset (Olivetti Re- class-independent LDA was used in all experiments. 66
O
16 search Laboratory, Cambridge) [48], which con- Moreover, Matlab Platform (R2013b) and using a PC 67
17 sists of 40 distinct individuals, was used. In this with the following specifications: Intel(R) Core(TM) 68
PR
18 dataset, each individual has ten images taken at i5-2400 CPU @ 3.10 GHz and 4.00 GB RAM, under 69
19 different times and varying light conditions. The Windows 32-bit operating system were used in our ex- 70
20 size of each image is 92 × 112. periments. 71
21 • Yale dataset14 is another face images dataset 72
22 which contains 165 grey scale images in GIF for- 7.2. Experiment on LDA parameter (eigenvectors) 73
mat of 15 individuals [78]. Each individual has
D
23 74
24 11 images in different expressions and configura- The aim of this experiment is to investigate the re- 75
tion: center-light, happy, left-light, with glasses, lation between the number of eigenvectors used in the
TE
25 76
26 normal, right-light, sad, sleepy, surprised, and a LDA space, and the classification accuracy based on 77
27 wink. these eigenvectors and the required CPU time for this 78
28 • 2D ear dataset (Carreira-Perpinan,1995)15 images classification. 79
EC
29 dataset [7] was used. The ear data set consists of As explained earlier that the LDA space consists of 80
30 17 distinct individuals. Six views of the left pro- k eigenvectors, which are sorted according to their ro- 81
31 file from each subject were taken under a uniform, bustness (i.e. their eigenvalues). The robustness of each 82
32 diffuse lighting. eigenvector reflects its ability to discriminate between 83
R
33
In all experiments, k-fold cross-validation tests have different classes. Thus, in this experiment, it will be 84
34
used. In k-fold cross-validation, the original samples of checked whether increasing the number of eigenvec- 85
R
35
the dataset were randomly partitioned into k subsets of tors would increase the total robustness of the con- 86
36
(approximately) equal size and the experiment is run k structed LDA space; hence, different classes could be 87
O
37
times. For each time, one subset was used as the test- well discriminated. Also, it will be tested whether in- 88
38
ing set and the other k − 1 subsets were used as the creasing the number of eigenvectors would increase the 89
C
39
training set. The average of the k results from the folds dimension of the LDA space and the projected data; 90
40
can then be calculated to produce a single estimation. hence, CPU time increases. To investigate these is- 91
N
41
In this study, the value of k was set to 10. sue, three datasets listed in Table 2 (i.e. ORL32×32 , 92
42
The images in all datasets resized to be 64 × 64 and Ear32×32 , Yale32×32 ), were used. Moreover, seven, 93
43 94
32 × 32 as shown in Table 2. Figure 9 shows samples
44
of the used datasets and Table 2 shows a description of ear, Yale datasets, respectively, are used in this exper- 95
45
the datasets used in our experiments. iment. The results of this experiment are presented in 96
46
In all experiments, to show the effect of the LDA Fig. 10. 97
47
with its eigenvector parameter and its SSS problem on From Fig. 10 it can be noticed that the accuracy and 98
48 CPU time are proportional with the number of eigen- 99
49 13 http://www.cam-orl.co.uk vectors which are used to construct the LDA space. 100
50 14 http://vision.ucsd.edu/content/yale-face-database Thus, the choice of using LDA in a specific applica- 101
51 15 http://faculty.ucmerced.edu/mcarreira-perpinan/software.html tion should consider a trade-off between these factors. 102
[research-article] p. 17/22
1 52
2 53
3 54
4 55
5 56
6 57
7 58
8 59
9 60
10 61
11 62
F
12 63
O
13 64
14 65
15
Fig. 9. Samples of the first individual in: ORL face dataset (top row); Ear dataset (middle row), and Yale face dataset (bottom row). 66
O
16 67
17
Moreover, from Fig. 10(a), it can be remarked that 68
PR
18
when the number of eigenvectors used in computing 69
19
the LDA space was increased, the classification accu- 70
20 racy was also increased to a specific extent after which 71
21 the accuracy remains constant. As seen in Fig. 10(a), 72
22 this extent differs from application to another. For ex- 73
ample, the accuracy of the ear dataset remains con-
D
23 74
24 stant when the percentage of the used eigenvectors is 75
more than 10%. This was expected as the eigenvec-
TE
25 76
26 tors of the LDA space are sorted according to their ro- 77
27 bustness (see Section 2.6). Similarly, in ORL and Yale 78
28 datasets the accuracy became approximately constant 79
EC
35 86
vectors which confirms our findings. From these re-
36 87
sults, we can conclude that the high order eigenvectors
O
37 88
of the data of each application (the first 10% of ear
38 89
database and the first 40% of ORL and Yale datasets)
C
39 90
are robust enough to extract and save the most discrim-
40 91
inative features which are used to achieve a good accu-
N
41 92
racy.
42 93
These experiments confirmed that increasing the
U
43 94
44
number of eigenvectors will increase the dimension of 95
45
the LDA space; hence, CPU time increases. Conse- 96
46
quently, the amount of discriminative information and 97
47 the accuracy increases. 98
48 99
49 16 The weight of the eigenvector represents the ratio of its cor- 100
Fig. 10. Accuracy and CPU time of the LDA techniques using dif- responding eigenvalue (λi ) to the total of all eigenvalues (λi , i =
50 101
ferent percentages of eigenvectors, (a) Accuracy (b) CPU time. λ
51 1, 2, . . . , k) as follows, k i . 102
j =1 λj
[research-article] p. 18/22
1 Algorithm 3 PCA-LDA 52
2 1: Read the training images (X = {x1 , x2 , . . . , xN }), 53
3 where xi (ro×co) represents the ith training image, 54
4 ro and co represent the rows (height) and columns 55
5 (width) of xi , respectively, N represents the total 56
6 number of training images. 57
7 2: Convert all images in vector representation Z = 58
8 {z1 , z2 , . . . , zN }, where the dimension of Z is 59
9 M × 1, M = ro × co. 60
10 3: calculate the mean of each class μi , total mean of 61
11 all data μ, between-class matrix SB (M × M), and 62
F
12 within-class matrix SW (M × M) as in Algorithm 1 63
O
13 (Step (2–5)). 64
14 4: Use the PCA technique to reduce the dimension of 65
15 Fig. 11. The robustness of the first 40 eigenvectors of the LDA tech- X to be equal to or lower than r, where r represents 66
O
16 nique using ORL32×32 , Ear32×32 , and Yale32×32 datasets. the rank of SW , as follows: 67
17 68
PR
18 7.3. Experiments on the small sample size problem XPCA = U X T
(38) 69
19 70
20 The aim of this experiment is to show when the LDA where, U ∈ RM×r is the lower dimensional space 71
21 is subject to the SSS problem and what are the methods of the PCA and XPCA represents the projected data 72
22 that could be used to solve this problem. In this experi- on the PCA space. 73
33 sample of the ORL dataset is 64×64 is 4096 and the to- sional space (Vk ). 84
34 tal number of samples is 400. The mathematical inter- 8: The original samples (X) are first projected on the 85
R
35 pretation of this point shows that the dimension of SW PCA space as in Equation (38). The projection on 86
36 is M × M, while the upper bound of the rank of SW is the LDA space is then calculated as follows: 87
O
41 PCA is first used to reduce the dimension of the origi- LDA space. 92
42 nal data to make SW full-rank, and then standard LDA 93
43 94
44 gorithm 3. For more details of the PCA-LDA method and the accuracy. The results of these scenarios using 95
45 are reported in [4]. In the direct-LDA method, the null- both PCA-LDA and direct-LDA methods are summa- 96
46 space of SW matrix is removed to make SW full-rank, rized in Table 3. 97
47 then standard LDA space can be calculated as in Algo- As summarized in Table 3, the rank of SW is very 98
48 rithm 4. More details of direct-LDA methods are found small compared to the whole dimension of SW ; hence, 99
49 in [83]. the SSS problem occurs in all cases. As shown in 100
50 Table 2 illustrates various scenarios designed to test Table 3, using the PCA-LDA and the Direct-LDA, 101
51 the effect of different dimensions on the rank of SW the SSS problem can be solved and the Direct-LDA 102
[research-article] p. 19/22
1 Table 3 52
2 Accuracy of the PCA-LDA and direct-LDA methods using the datasets listed in Table 2 53
3 Dataset Dim(SW ) # Training images # Testing images Rank(SW ) Accuracy (%) 54
4 PCA-LDA Direct-LDA 55
5 ORL32×32 1024 × 1024 5 5 160 75.5 88.5 56
6 7 3 240 75.5 97.5 57
7 9 1 340 80.39 97.5 58
8 Ear32×32 1024 × 1024 3 3 34 80.39 96.08 59
9 4 2 51 88.24 94.12 60
10 5 1 68 100 100 61
11 Yale32×32 1024 × 1024 6 5 75 78.67 90.67 62
F
12 8 3 105 84.44 97.79 63
10 1 135 100 100
O
13 64
14 ORL64×64 4096 × 4096 5 5 160 72 87.5 65
15 7 3 240 81.67 96.67 66
O
16 9 1 340 82.5 97.5 67
17 Ear64×64 4096 × 4096 3 3 34 74.5 96.08 68
PR
18 4 2 51 91.18 96.08 69
19 5 1 68 100 100 70
20 Yale64×64 4096 × 4096 6 5 75 74.67 92 71
21 8 3 105 95.56 97.78 72
22 10 1 135 93.33 100 73
The bold values indicate that the corresponding methods obtain best performances.
D
23 74
24 75
Algorithm 4 Direct-LDA method achieved results better than PCA-LDA because
TE
25 76
26 1: Read the training images (X = {x1 , x2 , . . . , xN }), as reported in [22,83], Direct-LDA method saves ap- 77
27 where xi (ro×co) represents the ith training image, proximately all important information for classifica- 78
28 ro and co represent the rows (height) and columns tion while the PCA in PCA-LDA method saves the in- 79
EC
33 1, M = ro × co. 84
34 3: Calculate the mean of each class μi , total mean In this paper, the definition, mathematics, and im- 85
R
35 of all data μ, between-class matrix SB (M × M), plementation of LDA were presented and explained. 86
36 and within-class matrix SW (M × M) of X as in The paper aimed to give low-level details on how the 87
O
37 Algorithm 1 (Step (2–5)). LDA technique can address the reduction problem by 88
38 4: Find the k eigenvectors of SB with non- extracting discriminative features based on maximiz- 89
39 90
ing the ratio between the between-class variance, SB ,
40 [u1 , u2 , . . . , uk ] (i.e. U T Sb U > 0). and within-class variance, SW , thus discriminating be- 91
N
41 5: Calculate the eigenvalues and eigenvectors of tween different classes. To achieve this aim, the pa- 92
42 U T SW U , and then sort the eigenvalues and dis- per followed the approach of not only explaining the 93
43 94
steps of calculating the SB , and SW (i.e. the LDA
44 selected eigenvectors are denoted by V ; thus, V space) but also visualizing these steps with figures 95
45 represents the null space of SW . and diagrams to make it easy to understand. More- 96
46 6: The final LDA matrix () consists of the range17 over, two LDA methods, i.e. class-dependent and class- 97
47 SB and the null space of SW as follows, = U V . independent, are explained and two numerical exam- 98
48 7: The original data are projected on the LDA space ples were given and graphically illustrated to explain 99
49 as follows, Y = X = XU V . 100
50 17 The range of a matrix A represents the column-space of A 101
51 (C(A)), where the dimension of C(A) is equal to the rank of A. 102
[research-article] p. 20/22
1 how the LDA space using the two methods can be con- [11] D. Coomans, D. Massart and L. Kaufman, Optimization by 52
2 structed. In all examples, the mathematical interpreta- statistical linear discriminant analysis in analytical chemistry, 53
3 tion of the robustness and the selection of the eigenvec- Analytica Chimica Acta 112(2) (1979), 97–122. doi:10.1016/ 54
S0003-2670(01)83513-3.
4 tors as well the data projection were detailed and dis- [12] J. Cui and Y. Xu, Three dimensional palmprint recognition
55
5 cussed. Also, LDA common problems (e.g. the SSS and using linear discriminant analysis method, in: Proceedings of 56
6 linearity) were mathematically explained using graph- the Second International Conference on Innovations in Bio- 57
7 ical examples, then their state-of-the-art solutions are Inspired Computing and Applications (IBICA), 2011, IEEE, 58
8 highlighted. Moreover, a detailed implementation of 2011, pp. 107–111. doi:10.1109/IBICA.2011.31. 59
[13] D.-Q. Dai and P.C. Yuen, Face recognition by regularized dis-
9 LDA applications was presented. Using three standard criminant analysis, IEEE Transactions on Systems, Man, and
60
10 datasets, a number of experiments were conducted to Cybernetics, Part B (Cybernetics) 37(4) (2007), 1080–1085. 61
11 (1) investigate and explain the relation between the doi:10.1109/TSMCB.2007.895363. 62
F
12 number of eigenvectors and the robustness of the LDA [14] D. Donoho and V. Stodden, When does non-negative matrix 63
space, (2) to practically show when the SSS problem factorization give a correct decomposition into parts? in: Ad-
O
13 64
vances in Neural Information Processing Systems 16, MIT
14 occurs and how it can be addressed. 65
Press, 2004, pp. 1141–1148.
15 66
O
[15] R.O. Duda, P.E. Hart and D.G. Stork, Pattern Classification,
16 2nd edn, Wiley, 2012. 67
17 References [16] S. Dudoit, J. Fridlyand and T.P. Speed, Comparison of discrim- 68
PR
18 ination methods for the classification of tumors using gene ex- 69
[1] S. Balakrishnama and A. Ganapathiraju, Linear discriminant pression data, Journal of the American Statistical Association
19 70
analysis-a brief tutorial, Institute for Signal and Information 97(457) (2002), 77–87. doi:10.1198/016214502753479248.
20 71
Processing (1998). [17] T.-t. Feng and G. Wu, A theoretical contribution to the fast im-
21 [2] E. Barshan, A. Ghodsi, Z. Azimifar and M.Z. Jahromi, Su- plementation of null linear discriminant analysis method us- 72
22 pervised principal component analysis: Visualization, classifi- ing random matrix multiplication with scatter matrices, 2014, 73
cation and regression on subspaces and submanifolds, Pattern arXiv preprint arXiv:1409.2579.
D
23 74
Recognition 44(7) (2011), 1357–1371. doi:10.1016/j.patcog. [18] J.H. Friedman, Regularized discriminant analysis, Journal of
24 75
2010.12.015. the American Statistical Association 84(405) (1989), 165–175.
TE
25 76
[3] G. Baudat and F. Anouar, Generalized discriminant analysis doi:10.1080/01621459.1989.10478752.
26 using a kernel approach, Neural Computation 12(10) (2000), [19] T. Gaber, A. Tharwat, A.E. Hassanien and V. Snasel, Biometric 77
27 2385–2404. doi:10.1162/089976600300014980. cattle identification approach based on Weber’s local descriptor 78
28 [4] P.N. Belhumeur, J.P. Hespanha and D. Kriegman, Eigenfaces and adaboost classifier, Computers and Electronics in Agricul- 79
vs. fisherfaces: Recognition using class specific linear projec- ture 122 (2016), 55–66. doi:10.1016/j.compag.2015.12.022.
EC
29 80
tion, IEEE Transactions on Pattern Analysis and Machine In- [20] T. Gaber, A. Tharwat, A. Ibrahim, V. Snáel and A.E. Has-
30 81
telligence 19(7) (1997), 711–720. doi:10.1109/34.598228. sanien, Human thermal face recognition based on random lin-
31 [5] N.V. Boulgouris and Z.X. Chi, Gait recognition using Radon ear oracle (rlo) ensembles, in: International Conference on In- 82
32 transform and linear discriminant analysis, IEEE Transactions telligent Networking and Collaborative Systems, IEEE, 2015, 83
R
35 86
2013. based feature extraction methods, in: 10th International Con-
36 [7] M. Carreira-Perpinan, Compression neural networks for fea- ference on Soft Computing Models in Industrial and Environ- 87
O
37 ture extraction: Application to human recognition from ear im- mental Applications, Springer, 2015, pp. 375–385. 88
38 ages, MS thesis, Faculty of Informatics, Technical University [22] H. Gao and J.W. Davis, Why direct lda is not equivalent to lda, 89
of Madrid, Spain, 1995. Pattern Recognition 39(5) (2006), 1002–1006. doi:10.1016/j.
C
39 90
[8] H.-P. Chan, D. Wei, M.A. Helvie, B. Sahiner, D.D. Adler, patcog.2005.11.016.
40 91
M.M. Goodsitt and N. Petrick, Computer-aided classification [23] G.H. Golub and C.F. Van Loan, Matrix Computations, 4th edn,
N
41 of mammographic masses and normal tissue: Linear discrimi- Vol. 3, Johns Hopkins University Press, 2012. 92
42 nant analysis in texture feature space, Physics in Medicine and [24] R. Haeb-Umbach and H. Ney, Linear discriminant analysis for 93
Biology 40(5) (1995), 857–876. doi:10.1088/0031-9155/40/5/ improved large vocabulary continuous speech recognition, in:
U
43 94
010. IEEE International Conference on Acoustics, Speech, and Sig-
44 95
[9] L. Chen, N.C. Carpita, W.-D. Reiter, R.H. Wilson, C. Jef- nal Processing (1992), Vol. 1, IEEE, 1992, pp. 13–16.
45 96
fries and M.C. McCann, A rapid method to screen for cell- [25] T. Hastie and R. Tibshirani, Discriminant analysis by Gaus-
46 wall mutants using discriminant analysis of Fourier transform sian mixtures, Journal of the Royal Statistical Society. Series B 97
47 infrared spectra, The Plant Journal 16(3) (1998), 385–392. (Methodological) (1996), 155–176. 98
48 doi:10.1046/j.1365-313x.1998.00301.x. [26] K. Héberger, E. Csomós and L. Simon-Sarkadi, Principal com- 99
[10] L.-F. Chen, H.-Y.M. Liao, M.-T. Ko, J.-C. Lin and G.-J. Yu, ponent and linear discriminant analyses of free amino acids
49 100
A new lda-based face recognition system which can solve the and biogenic amines in Hungarian wines, Journal of Agricul-
50 101
small sample size problem, Pattern Recognition 33(10) (2000), tural and Food Chemistry 51(27) (2003), 8055–8060. doi:10.
51 1713–1726. doi:10.1016/S0031-3203(99)00139-9. 1021/jf034851c. 102
[research-article] p. 21/22
1 [27] G.E. Hinton and R.R. Salakhutdinov, Reducing the dimension- [43] F. Pan, G. Song, X. Gan and Q. Gu, Consistent feature selec- 52
2 ality of data with neural networks, Science 313(5786) (2006), tion and its application to face recognition, Journal of Intelli- 53
3 504–507. doi:10.1126/science.1127647. gent Information Systems 43(2) (2014), 307–321. doi:10.1007/ 54
[28] K. Honda and H. Ichihashi, Fuzzy local independent com- s10844-014-0324-5.
4 55
ponent analysis with external criteria and its application to [44] C.H. Park and H. Park, Fingerprint classification using fast
5 56
knowledge discovery in databases, International Journal of Fourier transform and nonlinear discriminant analysis, Pat-
6 Approximate Reasoning 42(3) (2006), 159–173. doi:10.1016/j. tern Recognition 38(4) (2005), 495–503. doi:10.1016/j.patcog. 57
7 ijar.2005.10.011. 2004.08.013. 58
8 [29] Q. Hu, L. Zhang, D. Chen, W. Pedrycz and D. Yu, Gaussian [45] C.H. Park and H. Park, A comparison of generalized linear 59
kernel based fuzzy rough sets: Model, uncertainty measures discriminant analysis algorithms, Pattern Recognition 41(3)
9 60
and applications, International Journal of Approximate Rea- (2008), 1083–1097. doi:10.1016/j.patcog.2007.07.022.
10 61
soning 51(4) (2010), 453–471. doi:10.1016/j.ijar.2010.01.004. [46] S. Rezzi, D.E. Axelson, K. Héberger, F. Reniero, C. Mariani
11 [30] R. Huang, Q. Liu, H. Lu and S. Ma, Solving the small sam- and C. Guillou, Classification of olive oils using high through- 62
F
12 ple size problem of lda, in: Proceedings of 16th International put flow 1 h nmr fingerprinting with principal component anal- 63
Conference on Pattern Recognition, 2002, Vol. 3, IEEE, 2002, ysis, linear discriminant analysis and probabilistic neural net-
O
13 64
pp. 29–32. works, Analytica Chimica Acta 552(1) (2005), 13–24. doi:10.
14 65
[31] A. Hyvärinen, J. Karhunen and E. Oja, Independent Compo- 1016/j.aca.2005.07.057.
15 66
O
nent Analysis, Vol. 46, Wiley, 2004. [47] Y. Saeys, I. Inza and P. Larrañaga, A review of feature selection
16 [32] M. Kirby, Geometric Data Analysis: An Empirical Approach techniques in bioinformatics, Bioinformatics 23(19) (2007), 67
17 to Dimensionality Reduction and the Study of Patterns, Wiley, 2507–2517. doi:10.1093/bioinformatics/btm344. 68
PR
18 2000. [48] F.S. Samaria and A.C. Harter, Parameterisation of a stochas- 69
[33] D.T. Larose, Discovering Knowledge in Data: An Introduction tic model for human face identification, in: Proceedings of the
19 70
to Data Mining, 1st edn, Wiley, 2014. Second IEEE Workshop on Applications of Computer Vision,
20 71
[34] T. Li, S. Zhu and M. Ogihara, Using discriminant analysis 1994, IEEE, 1994, pp. 138–142.
21 for multi-class classification: An experimental investigation, [49] B. Schölkopf, C.J. Burges and A.J. Smola, Advances in Kernel 72
22 Knowledge and Information Systems 10(4) (2006), 453–472. Methods: Support Vector Learning, MIT Press, 1999. 73
doi:10.1007/s10115-006-0013-y. [50] B. Schölkopf and K.-R. Mullert, Fisher discriminant analysis
D
23 74
[35] K. Liu, Y.-Q. Cheng, J.-Y. Yang and X. Liu, An efficient al- with kernels, in: Proceedings of the 1999 IEEE Signal Process-
24 75
gorithm for foley–sammon optimal set of discriminant vectors ing Society Workshop Neural Networks for Signal Processing
TE
25 76
by algebraic method, International Journal of Pattern Recog- IX, Madison, WI, USA, 1999, pp. 41–48.
26 nition and Artificial Intelligence 6(5) (1992), 817–829. doi:10. [51] B. Schölkopf, A. Smola and K.-R. Müller, Nonlinear com- 77
27 1142/S0218001492000412. ponent analysis as a kernel eigenvalue problem, Neu- 78
28 [36] J. Lu, K.N. Plataniotis and A.N. Venetsanopoulos, Face recog- ral Computation 10(5) (1998), 1299–1319. doi:10.1162/ 79
nition using lda-based algorithms, IEEE Transactions on Neu- 089976698300017467.
EC
29 80
ral Networks 14(1) (2003), 195–200. doi:10.1109/TNN.2002. [52] A. Sharma, S. Imoto and S. Miyano, A between-class over-
30 81
806647. lapping filter-based method for transcriptome data analysis,
31 [37] J. Lu, K.N. Plataniotis and A.N. Venetsanopoulos, Regularized Journal of Bioinformatics and Computational Biology 10(5) 82
32 discriminant analysis for the small sample size problem in face (2012), 1250010. doi:10.1142/S0219720012500102. 83
R
33 recognition, Pattern Recognition Letters 24(16) (2003), 3079– [53] A. Sharma, S. Imoto and S. Miyano, A filter based feature se- 84
3087. doi:10.1016/S0167-8655(03)00167-3. lection algorithm using null space of covariance matrix for dna
34 85
[38] J. Lu, K.N. Plataniotis and A.N. Venetsanopoulos, Regular- microarray gene expression data, Current Bioinformatics 7(3)
R
35 86
ization studies of linear discriminant analysis in small sam- (2012), 289–294. doi:10.2174/157489312802460802.
36 ple size scenarios with application to face recognition, Pat- [54] A. Sharma, S. Imoto, S. Miyano and V. Sharma, Null space 87
O
37 tern Recognition Letters 26(2) (2005), 181–191. doi:10.1016/ based feature selection method for gene expression data, In- 88
38 j.patrec.2004.09.014. ternational Journal of Machine Learning and Cybernetics 3(4) 89
[39] B. Moghaddam, Y. Weiss and S. Avidan, Generalized spectral (2012), 269–276. doi:10.1007/s13042-011-0061-9.
C
39 90
bounds for sparse lda, in: Proceedings of the 23rd International [55] A. Sharma and K.K. Paliwal, Cancer classification by gradient
40 91
Conference on Machine Learning, ACM, 2006, pp. 641–648. lda technique using microarray gene expression data, Data &
N
41 [40] W. Müller, T. Nocke and H. Schumann, Enhancing the visu- Knowledge Engineering 66(2) (2008), 338–347. doi:10.1016/ 92
42 alization process with principal component analysis to support j.datak.2008.04.004. 93
the exploration of trends, in: Proceedings of the 2006 Asia- [56] A. Sharma and K.K. Paliwal, A new perspective to null linear
U
43 94
Pacific Symposium on Information Visualisation, Vol. 60, Aus- discriminant analysis method and its fast implementation us-
44 95
tralian Computer Society, Inc., 2006, pp. 121–130. ing random matrix multiplication with scatter matrices, Pattern
45 96
[41] S. Noushath, G.H. Kumar and P. Shivakumara, (2d) 2 lda: An Recognition 45(6) (2012), 2205–2213. doi:10.1016/j.patcog.
46 efficient approach for face recognition, Pattern Recognition 2011.11.018. 97
47 39(7) (2006), 1396–1400. doi:10.1016/j.patcog.2006.01.018. [57] A. Sharma and K.K. Paliwal, Linear discriminant analysis for 98
48 [42] K.K. Paliwal and A. Sharma, Improved pseudoinverse lin- the small sample size problem: An overview, International 99
ear discriminant analysis method for dimensionality re- Journal of Machine Learning and Cybernetics (2014), 1–12.
49 100
duction, International Journal of Pattern Recognition and [58] A.J. Smola and B. Schölkopf, A tutorial on support vector
50 101
Artificial Intelligence 26(1) (2012), 1250002. doi:10.1142/ regression, Statistics and Computing 14(3) (2004), 199–222.
51 S0218001412500024. doi:10.1023/B:STCO.0000035301.49549.88. 102
[research-article] p. 22/22
1 [59] G. Strang, Introduction to Linear Algebra, 4th edn, Wellesley- [72] V. Vapnik, The Nature of Statistical Learning Theory, 2nd edn, 52
2 Cambridge Press, Massachusetts, 2003. Springer, New York, 2013. 53
3 [60] D.L. Swets and J.J. Weng, Using discriminant eigenfeatures [73] J. Venna, J. Peltonen, K. Nybo, H. Aidos and S. Kaski, Infor- 54
for image retrieval, IEEE Transactions on Pattern Analysis and mation retrieval perspective to nonlinear dimensionality reduc-
4 55
Machine Intelligence 18(8) (1996), 831–836. doi:10.1109/34. tion for data visualization, The Journal of Machine Learning
5 56
531802. Research 11 (2010), 451–490.
6 [61] M.M. Tantawi, K. Revett, A. Salem and M.F. Tolba, Fiducial [74] P. Viszlay, M. Lojka and J. Juhár, Class-dependent two- 57
7 feature reduction analysis for electrocardiogram (ecg) based dimensional linear discriminant analysis using two-pass recog- 58
8 biometric recognition, Journal of Intelligent Information Sys- nition strategy, in: Proceedings of the 22nd European Signal 59
9
tems 40(1) (2013), 17–39. doi:10.1007/s10844-012-0214-7. Processing Conference (EUSIPCO), IEEE, 2014, pp. 1796– 60
[62] A. Tharwat, Principal component analysis-a tutorial, Interna- 1800.
10 61
tional Journal of Applied Pattern Recognition 3(3) (2016), [75] X. Wang and X. Tang, Random sampling lda for face recogni-
11 197–240. doi:10.1504/IJAPR.2016.079733. 62
tion, in: Proceedings of the 2004 IEEE Computer Society Con-
F
12 [63] A. Tharwat, T. Gaber, Y.M. Awad, N. Dey and A.E. Hassanien, ference on Computer Vision and Pattern Recognition (CVPR), 63
Plants identification using feature fusion technique and bag- Vol. 2, IEEE, 2004, pp. II–II.
O
13 64
14
ging classifier, in: The 1st International Conference on Ad- [76] M. Welling, Fisher Linear Discriminant Analysis, Vol. 3, De- 65
vanced Intelligent System and Informatics (AISI2015), Beni partment of Computer Science, University of Toronto, 2005.
15 66
O
Suef, Egypt, November 28–30, 2015, Springer, 2016, pp. 461– [77] M.C. Wu, L. Zhang, Z. Wang, D.C. Christiani and X. Lin,
16 471. 67
Sparse linear discriminant analysis for simultaneous testing
17 [64] A. Tharwat, T. Gaber and A.E. Hassanien, One-dimensional for the significance of a gene set/pathway and gene selec- 68
PR
18 vs. two-dimensional based features: Plant identification ap- tion, Bioinformatics 25(9) (2009), 1145–1151. doi:10.1093/ 69
19 proach, Journal of Applied Logic (2016). bioinformatics/btp019. 70
[65] A. Tharwat, T. Gaber, A.E. Hassanien, H.A. Hassanien and [78] J. Yang, D. Zhang, A.F. Frangi and J.-y. Yang, Two-
20 71
M.F. Tolba, Cattle identification using muzzle print images dimensional pca: A new approach to appearance-based face
21 72
based on texture features approach, in: Proceedings of the Fifth representation and recognition, IEEE Transactions on Pattern
22 International Conference on Innovations in Bio-Inspired Com- Analysis and Machine Intelligence 26(1) (2004), 131–137. 73
puting and Applications IBICA 2014, Springer, 2014, pp. 217– doi:10.1109/TPAMI.2004.1261097.
D
23 74
24 227. [79] L. Yang, W. Gong, X. Gu, W. Li and Y. Liang, Null space 75
[66] A. Tharwat, A.E. Hassanien and B.E. Elnaghi, A ba-based al- discriminant locality preserving projections for face recogni-
TE
25 76
gorithm for parameter optimization of support vector machine, tion, Neurocomputing 71(16) (2008), 3644–3649. doi:10.1016/
26 77
Pattern Recognition Letters (2016). j.neucom.2008.03.009.
27 [67] A. Tharwat, A. Ibrahim, A.E. Hassanien and G. Schaefer, [80] W. Yang and H. Wu, Regularized complete linear discriminant 78
28 Ear recognition using block-based principal component anal- analysis, Neurocomputing 137 (2014), 185–191. doi:10.1016/ 79
EC
33 sion transformation and multi-classifier technique, in: The 1st of null space and orthogonal linear discriminant analysis, The 84
34 International Conference on Advanced Intelligent System and Journal of Machine Learning Research 7 (2006), 1183–1204. 85
[83] H. Yu and J. Yang, A direct lda algorithm for high-
R
35
Informatics (AISI2015), Beni Suef, Egypt, November 28–30, 86
2015, Springer, 2016, pp. 183–193. dimensional data with application to face recognition, Pattern
36 87
[69] A. Tharwat, Y.S. Moemen and A.E. Hassanien, Classification Recognition 34(10) (2001), 2067–2070. doi:10.1016/S0031-
O
39 90
[70] C.G. Thomas, R.A. Harshman and R.S. Menon, Noise reduc- Biometrics: Theory, Applications, and Systems, 2007, BTAS
40 91
tion in bold-based fmri using component analysis, Neuroimage 2007, IEEE, 2007, pp. 1–5.
N
41 92
17(3) (2002), 1521–1537. doi:10.1006/nimg.2002.1200. [85] X.-S. Zhuang and D.-Q. Dai, Inverse Fisher discriminate cri-
42 teria for small sample size problem and its application to face 93
[71] M. Turk and A. Pentland, Eigenfaces for recognition, Journal
recognition, Pattern Recognition 38(11) (2005), 2192–2194.
U