Linear Discriminant Analysis: A Detailed Tutorial: Ai Communications May 2017

Download as pdf or txt
Download as pdf or txt
You are on page 1of 23

See discussions, stats, and author profiles for this publication at: https://www.researchgate.

net/publication/316994943

Linear discriminant analysis: A detailed tutorial

Article  in  Ai Communications · May 2017


DOI: 10.3233/AIC-170729

CITATIONS READS

33 32,629

4 authors:

Alaa Tharwat Tarek Gaber


Frankfurt University of Applied Sciences Suez Canal University
86 PUBLICATIONS   674 CITATIONS    66 PUBLICATIONS   380 CITATIONS   

SEE PROFILE SEE PROFILE

Abdelhameed Ibrahim Aboul Ella Hassanien


Mansoura University Cairo University
52 PUBLICATIONS   176 CITATIONS    916 PUBLICATIONS   5,763 CITATIONS   

SEE PROFILE SEE PROFILE

Some of the authors of this publication are also working on these related projects:

Utilizing Service Oriented Architecture in Smart Cities View project

Unsupervised Complex Networks Clustering View project

All content following this page was uploaded by Aboul Ella Hassanien on 02 July 2017.

The user has requested enhancement of the downloaded file.


[research-article] p. 1/22

AI Communications 00 (20xx) 1–22 1


DOI 10.3233/AIC-170729
1
IOS Press 52
2 53
3
4
Linear discriminant analysis: A detailed 54
55
5 56
6
7
tutorial 57
58
8 59
9 a,b,∗,∗∗ c,∗ d,∗ e,∗ 60
Alaa Tharwat , Tarek Gaber , Abdelhameed Ibrahim and Aboul Ella Hassanien
10 a Department 61
of Computer Science and Engineering, Frankfurt University of Applied Sciences, Frankfurt am Main,
11 62

F
Germany
12 b Faculty 63
of Engineering, Suez Canal University, Egypt

O
13 64
E-mail: [email protected]
14 c Faculty of Computers and Informatics, Suez Canal University, Egypt 65
15 66

O
E-mail: [email protected]
16 d Faculty of Engineering, Mansoura University, Egypt 67
17 68
E-mail: [email protected]

PR
18 e Faculty of Computers and Information, Cairo University, Egypt 69
19 70
E-mail: [email protected]
20 71
21 72
22
Abstract. Linear Discriminant Analysis (LDA) is a very common technique for dimensionality reduction problems as a pre- 73
processing step for machine learning and pattern classification applications. At the same time, it is usually used as a black box,
D
23 74
but (sometimes) not well understood. The aim of this paper is to build a solid intuition for what is LDA, and how LDA works,
24 75
thus enabling readers of all levels be able to get a better understanding of the LDA and to know how to apply this technique in
TE

25 76
different applications. The paper first gave the basic definitions and steps of how LDA technique works supported with visual
26 explanations of these steps. Moreover, the two methods of computing the LDA space, i.e. class-dependent and class-independent 77
27 methods, were explained in details. Then, in a step-by-step approach, two numerical examples are demonstrated to show how 78
28 the LDA space can be calculated in case of the class-dependent and class-independent methods. Furthermore, two of the most 79
EC

29 common LDA problems (i.e. Small Sample Size (SSS) and non-linearity problems) were highlighted and illustrated, and state- 80
30 of-the-art solutions to these problems were investigated and explained. Finally, a number of experiments was conducted with 81
31 different datasets to (1) investigate the effect of the eigenvectors that used in the LDA space on the robustness of the extracted 82
32
feature for the classification accuracy, and (2) to show when the SSS problem occurs and how it can be addressed. 83
R

33 Keywords: Dimensionality reduction, PCA, LDA, Kernel Functions, Class-Dependent LDA, Class-Independent LDA, SSS 84
34 (Small Sample Size) problem, eigenvectors artificial intelligence 85
R

35 86
36 87
1. Introduction are two major approaches of the dimensionality reduc-
O

37 88
38 tion techniques, namely, unsupervised and supervised 89
approaches. In the unsupervised approach, there is no
C

39 Dimensionality reduction techniques are important 90


40
in many applications related to machine learning [15], need for labeling classes of the data. While in the su- 91
pervised approach, the dimensionality reduction tech-
N

41 92
data mining [6,33], Bioinformatics [47], biometric [61]
42
and information retrieval [73]. The main goal of the di- niques take the class labels into consideration [15,32]. 93
There are many unsupervised dimensionality reduc-
U

43 94
mensionality reduction techniques is to reduce the di-
44
mensions by removing the redundant and dependent tion techniques such as Independent Component Anal- 95
45
features by transforming the features from a higher di- ysis (ICA) [28,31] and Non-negative Matrix Factor- 96
46
mensional space that may lead to a curse of dimension- ization (NMF) [14], but the most famous technique of 97
47
ality problem, to a space with lower dimensions. There the unsupervised approach is the Principal Component 98
48 Analysis (PCA) [4,62,67,71]. This type of data reduc- 99
49 * Scientific Research Group in Egypt, (SRGE), http://www. tion is suitable for many applications such as visualiza- 100
50 egyptscience.net. tion [2,40], and noise removal [70]. On the other hand, 101
51 ** Corresponding author. E-mail: [email protected]. the supervised approach has many techniques such as 102

0921-7126/17/$35.00 © 2017 – IOS Press and the authors. All rights reserved
[research-article] p. 2/22

2 A. Tharwat et al. / Linear discriminant analysis: A detailed tutorial

1 Mixture Discriminant Analysis (MDA) [25] and Neu- the authors presented different applications that used 52
2 ral Networks (NN) [27], but the most famous technique the LDA-SSS techniques such as face recognition and 53
3 of this approach is the Linear Discriminant Analysis cancer classification. Furthermore, they conducted dif- 54
4 (LDA) [50]. This category of dimensionality reduction ferent experiments using three well-known face recog- 55
5 techniques are used in biometrics [12,36], Bioinfor- nition datasets to compare between different variants 56
6 matics [77], and chemistry [11]. of the LDA technique. Nonetheless, in [57], there is 57
7 The LDA technique is developed to transform the no detailed explanation of how (with numerical exam- 58
8 features into a lower dimensional space, which max- ples) to calculate the within and between class vari- 59
9 imizes the ratio of the between-class variance to the ances to construct the LDA space. In addition, the steps 60
10 within-class variance, thereby guaranteeing maximum of constructing the LDA space are not supported with 61
11 class separability [43,76]. There are two types of LDA well-explained graphs helping for well understanding 62

F
12 technique to deal with classes: class-dependent and of the LDA underlying mechanism. In addition, the 63

class-independent. In the class-dependent LDA, one non-linearity problem was not highlighted.

O
13 64
14 separate lower dimensional space is calculated for each This paper gives a detailed tutorial about the LDA 65
15 class to project its data on it whereas, in the class- technique, and it is divided into five sections. Section 2 66

O
16 independent LDA, each class will be considered as a gives an overview about the definition of the main 67
17 separate class against the other classes [1,74]. In this idea of the LDA and its background. This section be- 68

PR
18 type, there is just one lower dimensional space for all gins by explaining how to calculate, with visual expla- 69
19 classes to project their data on it. nations, the between-class variance, within-class vari- 70
20 Although the LDA technique is considered the most ance, and how to construct the LDA space. The algo- 71
21 well-used data reduction techniques, it suffers from a rithms of calculating the LDA space and projecting the 72
22 number of problems. In the first problem, LDA fails to data onto this space to reduce its dimension are then 73
introduced. Section 3 illustrates numerical examples to
D
23 find the lower dimensional space if the dimensions are 74
24 much higher than the number of samples in the data show how to calculate the LDA space and how to select 75

matrix. Thus, the within-class matrix becomes singu- the most robust eigenvectors to build the LDA space.
TE

25 76
26 lar, which is known as the small sample problem (SSS). While Section 4 explains the most two common prob- 77
27 There are different approaches that proposed to solve lems of the LDA technique and a number of state-of- 78
28 this problem. The first approach is to remove the null the-art methods to solve (or approximately solve) these 79
problems. Different applications that used LDA tech-
EC

29 space of within-class matrix as reported in [56,79]. The 80


nique are introduced in Section 5. In Section 6, differ-
30 second approach used an intermediate subspace (e.g. 81
ent packages for the LDA and its variants were pre-
31 PCA) to convert a within-class matrix to a full-rank 82
sented. In Section 7, two experiments are conducted to
32 matrix; thus, it can be inverted [4,35]. The third ap- 83
show (1) the influence of the number of the selected
R

33 proach, a well-known solution, is to use the regulariza- 84


eigenvectors on the robustness and dimension of the
34 tion method to solve a singular linear systems [38,57]. 85
LDA space, (2) how the SSS problem occurs and high-
R

35 In the second problem, the linearity problem, if differ- 86


lights the well-known methods to solve this problem.
36 ent classes are non-linearly separable, the LDA can- 87
Finally, concluding remarks will be given in Section 8.
O

37 not discriminate between these classes. One solution to 88


38 this problem is to use the kernel functions as reported 89
C

39 in [50]. 2. LDA technique


90
40 The brief tutorials on the two LDA types are re- 91
N

41 ported in [1]. However, the authors did not show the 2.1. Definition of LDA 92
42 LDA algorithm in details using numerical tutorials, vi- 93

sualized examples, nor giving insight investigation of


U

43 The goal of the LDA technique is to project the orig- 94


44 experimental results. Moreover, in [57], an overview inal data matrix onto a lower dimensional space. To 95
45 of the SSS for the LDA technique was presented in- achieve this goal, three steps needed to be performed. 96
46 cluding the theoretical background of the SSS prob- The first step is to calculate the separability between 97
47 lem. Moreover, different variants of LDA technique different classes (i.e. the distance between the means 98
48 were used to solve the SSS problem such as Di- of different classes), which is called the between-class 99
49 rect LDA (DLDA) [22,83], regularized LDA (RLDA) variance or between-class matrix. The second step is to 100
50 [18,37,38,80], PCA+LDA [42], Null LDA (NLDA) calculate the distance between the mean and the sam- 101
51 [10,82], and kernel DLDA (KDLDA) [36]. In addition, ples of each class, which is called the within-class vari- 102
[research-article] p. 3/22

A. Tharwat et al. / Linear discriminant analysis: A detailed tutorial 3

1 ance or within-class matrix. The third step is to con- 2.2. Calculating the between-class variance (SB ) 52
2 struct the lower dimensional space which maximizes 53
3 the between-class variance and minimizes the within- 54
The between-class variance of the ith class (SBi )
4 55
class variance. This section will explain these three represents the distance between the mean of the ith
5 56
steps in detail, and then the full description of the LDA class (μi ) and the total mean (μ). LDA technique
6 57
algorithm will be given. Figures 1 and 2 are used to searches for a lower-dimensional space, which is used
7 58
visualize the steps of the LDA technique. to maximize the between-class variance, or simply
8 59
9 60
10 61
11 62

F
12 63

O
13 64
14 65
15 66

O
16 67
17 68

PR
18 69
19 70
20 71
21 72
22 73
D
23 74
24 75
TE

25 76
26 77
27 78
28 79
EC

29 80
30 81
31 82
32 83
R

33 84
34 85
R

35 86
36 87
O

37 88
38 89
C

39 90
40 91
N

41 92
42 93
U

43 94
44 95
45 96
46 97
47 98
48 99
49 100
50 101
51 Fig. 1. Visualized steps to calculate a lower dimensional subspace of the LDA technique. 102
[research-article] p. 4/22

4 A. Tharwat et al. / Linear discriminant analysis: A detailed tutorial

1 52
2 53
3 54
4 55
5 56
6 57
7 58
8 59
9 60
10 61
11 62

F
12 63

O
13 64
14 Fig. 2. Projection of the original samples (i.e. data matrix) on the lower dimensional space of LDA (Vk ). 65
15 66

O
16 maximize the separation distance between classes. class and the total mean in step (B) and (C), respec- 67
17 To explain how the between-class variance or the tively. 68

PR
18 between-class matrix (SB ) can be calculated, the fol- 69
19 lowing assumptions are made. Given the original data 1  70
μj = xi (2)
20 matrix X = {x1 , x2 , . . . , xN }, where xi represents the nj x ∈ω
i j
71
21 ith sample, pattern, or observation and N is the to- 72

1  
22 N c 73
tal number of samples. Each sample is represented by ni
M features (xi ∈ RM ). In other words, each sam- μ= xi = μi (3)
D
23 74
N N
24 i=1 i=1 75
ple is represented as a point in M-dimensional space.
Assume the data matrix is partitioned into c = 3
TE

25 76
where c represents the total number of classes (in our
26
classes as follows, X = [ω1 , ω2 , ω3 ] as shown in example c = 3).
77
27 78
Fig. 1 (step (A)). Each class has five samples (i.e. The term (μi − μ)(μi − μ)T in Equation (1) repre-
28 79
n1 = n2 = n3 = 5), where ni represents the number of sents the separation distance between the mean of the
EC

29 80
samples of the ith class. The total number of samples ith class (μi ) and the total mean (μ), or simply it repre-
30 81
(N ) is calculated as follows, N = 3i=1 ni . sents the between-class variance of the ith class (SBi ).
31 82
To calculate the between-class variance (SB ), the Substitute SBi into Equation (1) as follows:
32 83
separation distance between different classes which
R

33 84
34
is denoted by (mi − m) will be calculated as fol- (mi − m)2 = W T SBi W (4) 85
lows:
R

35 86
36  2 The total between-class
 variance is calculated as fol- 87
(mi − m)2 = W T μi − W T μ lows, (SB = ci=1 ni SBi ). Figure 1 (step (D)) shows
O

37 88
38 first how the between-class matrix of the first class 89
= W (μi − μ)(μi − μ) W
T T
(1) (SB1 ) is calculated and then how the total between-
C

39 90
40 class matrix (SB ) is then calculated by adding all the 91
where mi represents the projection of the mean of the between-class matrices of all classes.
N

41 92
42
ith class and it is calculated as follows, mi = W T μi , 93
where m is the projection of the total mean of all 2.3. Calculating the within-class variance (SW )
U

43 94
44 classes and it is calculated as follows, m = W T μ, W 95
45 represents the transformation matrix of LDA,1 μi (1 × The within-class variance of the ith class (SWi ) rep- 96
46 M) represents the mean of the ith class and it is com- resents the difference between the mean and the sam- 97
47 puted as in Equation (2), and μ(1 × M) is the to- ples of that class. LDA technique searches for a lower- 98
48 tal mean of all classes and it can be computed as in dimensional space, which is used to minimize the dif- 99
49 Equation (3) [36,83]. Figure 1 shows the mean of each ference between the projected mean (mi ) and the pro- 100
50 jected samples of each class (W T xi ), or simply min- 101
51 1 The transformation matrix (W ) will be explained in Section 2.4. imizes the within-class variance [36,83]. The within- 102
[research-article] p. 5/22

A. Tharwat et al. / Linear discriminant analysis: A detailed tutorial 5

1 class variance of each class (SWj ) is calculated as in where λ represents the eigenvalues of the transfor- 52
2 Equation (5). mation matrix (W ). The solution of this problem 53
3   T 2 can be obtained by calculating the eigenvalues (λ = 54
4 W xi − mj {λ1 , λ2 , . . . , λM }) and eigenvectors (V = {v1 , v2 , . . . , 55
−1
5 xi ∈ωj ,j =1,...,c vM }) of W = SW SB , if SW is non-singular [36,81,83]. 56
6   2 The eigenvalues are scalar values, while the eigen- 57
7
= W T xij − W T μj vectors are non-zero vectors, which satisfies the Equa- 58
8 xi ∈ωj ,j =1,...,c tion (8) and provides us with the information about 59
9
 the LDA space. The eigenvectors represent the direc- 60
= W T (xij − μj )2 W
10 tions of the new space, and the corresponding eigen- 61
xi ∈ωj ,j =1,...,c
11 values represent the scaling factor, length, or the mag- 62


F
12
= W (xij − μj )(xij − μj ) W
T T nitude of the eigenvectors [34,59]. Thus, each eigen- 63
vector represents one axis of the LDA space, and

O
13 64
xi ∈ωj ,j =1,...,c
14
 the associated eigenvalue represents the robustness 65
15 = W T SWj W (5) of this eigenvector. The robustness of the eigenvec- 66

O
16 xi ∈ωj ,j =1,...,c tor reflects its ability to discriminate between differ- 67
17 ent classes, i.e. increase the between-class variance, 68

PR
18 From Equation (5), the within-class variance for and decreases the within-class variance of each class; 69
19 each class can be calculated as follows, SWj = djT ∗ hence meets the LDA goal. Thus, the eigenvectors with 70
nj the k highest eigenvalues are used to construct a lower
20 dj = i=1 (xij −μj )(xij −μj )T , where xij represents 71
21 the ith sample in the j th class as shown in Fig. 1 (step dimensional space (Vk ), while the other eigenvectors 72
22 (E), (F)), and dj is the centering data of the j th class, ({vk+1 , vk+2 , vM }) are neglected as shown in Fig. 1 73
nj (step (G)).
i.e. dj = ωj − μj = {xi }i=1 − μj . Moreover, step (F)
D
23 74
Figure 2 shows the lower dimensional space of the
24 in the figure illustrates how the within-class variance 75
LDA technique, which is calculated as in Fig. 1 (step
of the first class (SW1 ) in our example is calculated.
TE

25 76
26 The total within-class variance represents the sum of (G)). As shown, the dimension of the original data ma- 77
27 all within-class matrices of all classes (see Fig. 1 (step trix (X ∈ RN×M ) is reduced by projecting it onto the 78
28 (F))), and it can be calculated as in Equation (6). lower dimensional space of LDA (Vk ∈ RM×k ) as de- 79
noted in Equation (9) [81]. The dimension of the data
EC

29 80
30

3 after projection is k; hence, M − k features are ig- 81
SW = SWi nored or deleted from each sample. Thus, each sample
31 82
i=1 (xi ) which was represented as a point a M-dimensional
32
 space will be represented in a k-dimensional space by
83
R

33 = (xi − μ1 )(xi − μ1 )T 84
projecting it onto the lower dimensional space (Vk ) as
34 xi ∈ω1 85
 follows, yi = xi Vk .
R

35 86
36 + (xi − μ2 )(xi − μ2 )T 87
xi ∈ω2 Y = XVk (9)
O

37 88

38
+ (xi − μ3 )(xi − μ3 ) T
(6) Figure 3 shows a comparison between two lower- 89
C

39 90
xi ∈ω3 dimensional sub-spaces. In this figure, the original data
40 91
which consists of three classes as in our example are
2.4. Constructing the lower dimensional space
N

41 92
plotted. Each class has five samples, and all samples
42
are represented by two features only (xi ∈ R2 ) to 93
After calculating the between-class variance (SB )
U

43 94
be visualized. Thus, each sample is represented as a
44
and within-class variance (SW ), the transformation ma- 95
trix (W ) of the LDA technique can be calculated as in point in two-dimensional space. The transformation
45
Equation (7), which is called Fisher’s criterion. This matrix (W (2 × 2)) is calculated using the steps in 96
46 Section 2.2, 2.3, and 2.4. The eigenvalues (λ1 and λ2 ) 97
formula can be reformulated as in Equation (8).
47 and eigenvectors (i.e. sub-spaces) (V = {v1 , v2 }) of 98
48 W T SB W W are then calculated. Thus, there are two eigenvec- 99
49 arg max T (7) tors or sub-spaces. A comparison between the two 100
W W SW W
50 lower-dimensional sub-spaces shows the following no- 101
51 SW W = λSB W (8) tices: 102
[research-article] p. 6/22

6 A. Tharwat et al. / Linear discriminant analysis: A detailed tutorial

1 separate lower dimensional space is calculated for each 52


−1
2 class as follows, Wi = SW S , where Wi represents
i B
53
3 the transformation matrix for the ith class. Thus, eigen- 54
4 values and eigenvectors are calculated for each trans- 55
5 formation matrix separately. Hence, the samples of 56
6 each class are projected on their corresponding eigen- 57
7 vectors. On the other hand, in the class-independent 58
8 method, one lower dimensional space is calculated for 59
9 all classes. Thus, the transformation matrix is calcu- 60
10 lated for all classes, and the samples of all classes are 61
11 projected on the selected eigenvectors [1]. 62

F
12 63

2.6. LDA algorithm

O
13 64
14 65
15 In this section the detailed steps of the algorithms of 66

O
16 the two LDA methods are presented. As shown in Al- 67
17 gorithms 1 and 2, the first four steps in both algorithms 68

PR
18 are the same. Table 1 shows the notations which are 69
19
Fig. 3. A visualized comparison between the two lower-dimensional
used in the two algorithms. 70
20 71
sub-spaces which are calculated using three different classes.
21 2.7. Computational complexity of LDA 72
22
• First, the separation distance between different 73
In this section, the computational complexity for
D
23 74
classes when the data are projected on the first
24
eigenvector (v1 ) is much greater than when the LDA is analyzed. The computational complexity for 75
the first four steps, common in both class-dependent
TE

25 76
data are projected on the second eigenvector (v2 ).
26
As shown in the figure, the three classes are effi- and class-independent methods, are computed as fol- 77
27
ciently discriminated when the data are projected lows. As illustrated in Algorithm 1, in step (2), to cal- 78
28
on v1 . Moreover, the distance between the means culate the mean of the ith class, there are ni M addi- 79
tions and M divisions, i.e., in total, there are (N M +
EC

29
of the first and second classes (m1 −m2 ) when the 80
30
original data are projected on v1 is much greater cM) operations. In step (3), there are N M additions 81
31
than when the data are projected on v2 , which re- and M divisions, i.e., there are (NM + M) opera- 82
32
flects that the first eigenvector discriminates the tions. The computational complexity of the fourth step 83
R

33
three classes better than the second one. is c(M + M 2 + M 2 ), where M is for μi − μ, M 2 for 84
34
• Second, the within-class variance when the data (μi − μ)(μi − μ)T , and the last M 2 is for the multipli- 85
cation between ni and the matrix (μi −μ)(μi −μ)T . In
R

35 86
are projected on v1 is much smaller than when it
36 the fifth step, there are N (M + M 2 ) operations, where 87
projected on v2 . For example, SW1 when the data
M is for (xij −μj ) and M 2 is for (xij −μj )(xij −μj )T .
O

37 88
are projected on v1 is much smaller than when the
38 In the sixth step, there are M 3 operations to calculate 89
data are projected on v2 . Thus, projecting the data −1 −1
SW , M 3 is for the multiplication between SW and
C

39 90
on v1 minimizes the within-class variance much
40
better than v2 . SB , and M 3 to calculate the eigenvalues and eigenvec- 91
tors. Thus, in class-independent method, the computa-
N

41 92
42 From these two notes, we conclude that the first eigen- tional complexity is O(N M 2 ) if N > M; otherwise, 93
vector meets the goal of the lower-dimensional space the complexity is O(M 3 ).
U

43 94
44 of the LDA technique than the second eigenvector; In Algorithm 2, the number of operations to cal- 95
45 hence, it is selected to construct a lower-dimensional culate the within-class variance for each class SWj in 96
46 space. the sixth step is nj (M + M 2 ), and to calculate SW , 97
47 N (M + M 2 ) operations are needed. Hence, calculat- 98
48 2.5. Class-dependent vs. class-independent methods ing the within-class variance for both LDA methods 99
49 are the same. In the seventh step and eighth, there 100
50 The aim of the two methods of the LDA is to calcu- are M 3 operations for the inverse, M 3 for the multi- 101
−1
51 late the LDA space. In the class-dependent LDA, one plication of SW S , and M 3 for calculating eigenval-
i B
102
[research-article] p. 7/22

A. Tharwat et al. / Linear discriminant analysis: A detailed tutorial 7

1 Algorithm 1 Linear Discriminant Analysis (LDA): Algorithm 2 Linear Discriminant Analysis (LDA): 52
2 Class-Independent Class-Dependent 53
3
1: Given a set of N samples [xi ]N
i=1 ,
each of which is 1: Given a set of N samples [xi ]N
i=1 ,
each of which is 54
4 55
represented as a row of length M as in Fig. 1 (step represented as a row of length M as in Fig. 1 (step
5
(A)), and X(N × M) is given by, (A)), and X(N × M) is given by, 56
6 57
7
⎡ ⎤ ⎡ ⎤ 58
x(1,1) x(1,2) ... x(1,M) x(1,1) x(1,2) ... x(1,M)
8 ⎢ x(2,1) x(2,2) ... x(2,M) ⎥ ⎢ x(2,1) x(2,2) ... x(2,M) ⎥ 59
⎢ ⎥ ⎢ ⎥
9 X=⎢ . .. .. .. ⎥ (10) X=⎢ . .. .. .. ⎥ (13) 60
10
⎣ .. . . . ⎦ ⎣ .. . . . ⎦ 61
11 x(N,1) x(N,2) ... x(N,M) x(N,1) x(N,2) ... x(N,M) 62

F
12 63

Compute the mean of each class μi (1 × M) as in Compute the mean of each class μi (1 × M) as in

O
13 64
2: 2:
14 65
Equation (2). Equation (2).
15
3: Compute the total mean of all data μ(1 × M) as in Compute the total mean of all data μ(1 × M) as in 66

O
3:
16 67
Equation (3). Equation (3).
17
4: Calculate between-class matrix SB (M ×M) as fol- 4: Calculate between-class matrix SB (M × M) as in 68

PR
18 69
lows: Equation (11)
19
5: for all Class i, i = 1, 2, . . . , c do 70
20 
c
6: Compute within-class matrix of each class 71
SB = ni (μi − μ)(μi − μ)T (11)
21
SWi (M × M), as follows: 72
22 i=1 73

D
23 74
24 5: Compute within-class matrix SW (M × M), as fol- SWj = (xi − μj )(xi − μj )T (14) 75
xi ∈ωj
lows:
TE

25 76
26 77
nj
27 
c 
7: Construct a transformation matrix for each class 78
28
SW = (xij − μj )(xij − μj )T (12) (Wi ) as follows: 79
j =1 i=1
EC

29 80
−1
30 Wi = SW S
i B
(15) 81
31
where xij represents the ith sample in the j th 82
class.
32
6: From Equation (11) and (12), the matrix W that 8: The eigenvalues (λi )
and eigenvectors of (V i ) 83
R

33 84
maximizing Fisher’s formula which is defined in each transformation matrix (Wi ) are then calcu-
34
Equation (7) is calculated as follows, W = SW−1
SB . lated, where λi and V i represent the calculated 85
R

35 86
The eigenvalues (λ) and eigenvectors (V ) of W are eigenvalues and eigenvectors of the ith class, re-
36 87
then calculated. spectively.
O

37 88
7: Sorting eigenvectors in descending order accord- 9: Sorting the eigenvectors in descending order ac-
38 89
ing to their corresponding eigenvalues. The first k cording to their corresponding eigenvalues. The
C

39 90
eigenvectors are then used as a lower dimensional first k eigenvectors are then used to construct a
40
space (Vk ). lower dimensional space for each class Vki . 91
N

41 92
8: Project all original samples (X) onto the lower di- 10: Project the samples of each class (ωi ) onto their
42
mensional space of LDA as in Equation (9). lower dimensional space (Vki ), as follows: 93
U

43 94
44 j 95
j = xi Vk , xi ∈ ω j (16)
45 96
46 97
ues and eigenvectors. These two steps are repeated for where j represents the projected samples of
47 98
each class which increases the complexity of the class- the class ωj .
48 99
11: end for
49 dependent algorithm. Totally, the computational com- 100
50 plexity of the class-dependent algorithm is O(N M 2 ) if 101
51 N > M; otherwise, the complexity is O(cM 3 ). Hence, 102
[research-article] p. 8/22

8 A. Tharwat et al. / Linear discriminant analysis: A detailed tutorial

Table 1
1 numbers are rounded up to the nearest hundredths 52
Notation
2 (i.e. only two digits after the decimal point are dis- 53
3 Notation Description played). 54
4 X Data matrix The first four steps of both class-independent and 55
5 N Total number of samples in X class-dependent methods are common as illustrated in 56
6 W Transformation matrix Algorithms 1 and 2. Thus, in this section, we show how 57
7 ni Number of samples in ωi these steps are calculated. 58
8 μi The mean of the ith class Given two different classes, ω1 (5 × 2) and ω2 (6 × 2) 59
9 μ Total or global mean of all samples have (n1 = 5) and (n2 = 6) samples, respectively. 60
10 SW i Within-class variance or scatter matrix of the ith class Each sample in both classes is represented by two fea- 61
11 (ωi ) tures (i.e. M = 2) as follows: 62

F
12 SBi Between-class variance of the ith class (ωi ) 63
V Eigenvectors of W
⎡ ⎤
1.00 2.00

O
13 64
14 Vi ith eigenvector ⎢2.00 3.00⎥ 65
⎢ ⎥
15 xij The ith sample in the j th class ω1 = ⎢
⎢3.00 3.00⎥⎥ and 66

O
16 k The dimension of the lower dimensional space (Vk ) ⎣4.00 5.00⎦ 67
17 xi ith sample 5.00 5.00 68

PR
18 M Dimension of X or the number of features of X ⎡ ⎤ (17)
69
19 Vk The lower dimensional space 4.00 2.00 70
⎢5.00 0.00⎥
20 c Total number of classes ⎢ ⎥ 71
⎢5.00 2.00⎥
21 mi The mean of the ith class after projection
ω2 = ⎢
⎢3.00
⎥ 72
22 m The total mean of all classes after projection ⎢ 2.00⎥⎥ 73
SW Within-class variance ⎣5.00 3.00⎦
D
23 74
24 SB Between-class variance 6.00 3.00 75
λ Eigenvalue matrix
TE

25 76
26 λi ith eigenvalue To calculate the lower dimensional space using 77
27 Y Projection of the original data LDA, first the mean of each class μj is calculated. The 78
28 ωi ith Class total mean μ(1×2) is then calculated, which represents 79
the mean of all means of all classes. The values of the
EC

29 80
30 the class-dependent method needs computations more mean of each class and the total mean are shown below, 81
31 than class-independent method. 82
32 In our case, we assumed that there are 40 classes μ1 = 3.00 3.60 , 83

R

33 and each class has ten samples. Each sample is repre- μ2 = 4.67 2.00 , and (18) 84
34 sented by 4096 features (M > N ). Thus, the compu- 5 85
μ = 11 6
μ2 = 3.91 2.727
R

35 tational complexity of the class-independent method is μ1 11 86


36 O(M 3 ) = 40963 , while the class-dependent method 87
The between-class variance of each class (SBi (2 ×
O

37 needs O(cM 3 ) = 40 × 40963 . 88


38 2)) and the total between-class variance (SB (2×2)) are 89
calculated. The values of the between-class variance of
C

39 90
40 3. Numerical examples the first class (SB1 ) is equal to, 91
N

41 92
42 In this section, two numerical examples will be SB1 = n1 (μ1 − μ)T (μ1 − μ) 93

presented. The two numerical examples explain the T


U

43 94
= 5 −0.91 0.87 −0.91 0.87
44 steps to calculate the LDA space and how the LDA 95
 
45 technique is used to discriminate between only two 4.13 −3.97 96
= (19)
46 different classes. In the first example, the lower- −3.97 3.81 97
47 dimensional space is calculated using the class- 98
48 independent method, while in the second example, the Similarly, SB2 is calculated as follows: 99
49 class-dependent method is used. Moreover, a compar- 100
 
50 ison between the lower dimensional spaces of each 3.44 −3.31 101
SB2 = (20)
51 method is presented. In all numerical examples, the −3.31 3.17 102
[research-article] p. 9/22

A. Tharwat et al. / Linear discriminant analysis: A detailed tutorial 9

1 The total between-class variance is calculated a fol- class matrix for each class and the total within-class 52
2 lows: matrix are as follows: 53
3   54
10.00 8.00
4
SB = SB1 + SB2 SW1 = ,
55
5
  8.00 7.20 56
6 4.13 −3.97   57
= 5.33 1.00
7 −3.97 3.81 SW2 = , (23) 58
  1.00 6.00
8 59
3.44 −3.31  
9 + 15.33 9.00 60
−3.31 3.17 SW =
10
  9.00 13.20 61
11 7.58 −7.27 62
=

F
(21)
12 −7.27 6.98 The transformation matrix (W ) in the class-independent 63
−1
method can be obtained as follows, W = SW SB , and

O
13 64
−1
14 To calculate the within-class matrix, first subtract the values of (SW ) and (W ) are as follows: 65
15 66
the mean of each class from each sample in that class  

O
16
and this step is called mean-centering data and it is cal- −1 0.11 −0.07 67
SW = and
17
culated as follows, di = ωi − μi , where di represents −0.07 0.13 68

PR
18
centering data of the class ωi . The values of d1 and d2   (24) 69
1.36 −1.31
19
are as follows: W = 70
20 −1.48 1.42 71
21 ⎡ ⎤ 72
−2.00 −1.60 The eigenvalues (λ(2 × 2)) and eigenvectors (V (2 ×
22 ⎢−1.00 −0.60⎥ 73
⎢ ⎥ 2)) of W are then calculated as follows:
d1 = ⎢ ⎥
D
23 74
⎢ 0.00 −0.60⎥ and
 
24
⎣ 1.00 1.40 ⎦ 0.00 0.00
75

λ=
TE

25
2.00 1.40 and 76
0.00 2.78
26
⎡ ⎤ (22)   (25)
77
27 −0.67 0.00 −0.69 0.68 78
⎢ 0.33 −2.00⎥ V =
28
⎢ ⎥ −0.72 −0.74 79
⎢ 0.33 0.00 ⎥
EC

d2 = ⎢ ⎥
29 80
⎢−1.67 0.00 ⎥
30
⎢ ⎥ From the above results it can be noticed that, the 81
31 ⎣ 0.33 1.00 ⎦ second eigenvector (V2 ) has corresponding eigenvalue 82
32 1.33 1.00 more than the first one (V1 ), which reflects that, the 83
R

33 second eigenvector is more robust than the first one; 84


34 In the next two subsections, two different methods hence, it is selected to construct the lower dimen- 85
R

35 are used to calculate the LDA space. sional space. The original data is projected on the 86
36 lower dimensional space, as follows, yi = ωi V2 , where 87
O

37 yi (ni × 1) represents the data after projection of the ith 88


38 class, and its values will be as follows: 89
3.1. Class-independent method
C

39 90
40 y1 = ω1 V2 91
⎡ ⎤
N

41 92
42
In this section, the LDA space is calculated us- 1.00 2.00 93
ing the class-independent method. This method rep- ⎢2.00 3.00⎥  
⎢ ⎥ 0.68
U

43 94
resents the standard method of LDA as in Algo- ⎢
= ⎢3.00 3.00⎥⎥
⎣4.00 5.00⎦ −0.74
44 95
45 rithm 1. 96
46 After centring the data, the within-class variance 5.00 5.00 97
for each class (SWi (2 × 2)) is calculated as follows, ⎡ ⎤
47
nj −0.79 98
48 SWj = djT ∗ dj = i=1 (xij − μj )T (xij − μj ), where ⎢−0.85⎥ 99
⎢ ⎥
49 xij represents the ith sample in the j th class. The total =⎢⎢−0.18⎥
⎥ (26) 100
50 within-class matrix (SW (2 × 2)) is then calculated as ⎣−0.97⎦ 101
c
51 follows, SW = i=1 SWi . The values of the within- −0.29 102
[research-article] p. 10/22

10 A. Tharwat et al. / Linear discriminant analysis: A detailed tutorial

1 3.2. Class-dependent method 52


2 53
3 In this section, the LDA space is calculated using the 54
4 class-dependent method. As mentioned in Section 2.5, 55
5 the class-dependent method aims to calculate a sepa- 56
6 rate transformation matrix (Wi ) for each class. 57
7 The within-class variance for each class (SWi (2 × 58
8 2)) is calculated as in class-independent method. The 59
9 transformation matrix (Wi ) for each class is then cal- 60
−1
10 culated as follows, Wi = SW S . The values of the
i B
61
11 two transformation matrices (W1 and W2 ) will be as 62

F
12 follows: 63

O
13 64
−1
14 W 1 = SW 1
SB 65
15  −1   66

O
10.00 8.00 7.58 −7.27
16
= 67
17 8.00 7.20 −7.27 6.98 68
  

PR
18
0.90 −1.00 7.58 −7.27 69
=
19
−1.00 1.25 −7.27 6.98 70
20
  71
21 14.09 −13.53 72
= (28)
22 −16.67 16.00 73
D
23 74
24 Similarly, W2 is calculated as follows: 75
Fig. 4. Probability density function of the projected data of the first
 
TE

25 76
example, (a) the projected data on V1 , (b) the projected data on V2 .
1.70 −1.63
26
W2 = (29) 77
27 Similarly, y2 is as follows: −1.50 1.44 78
28
⎡ ⎤ The eigenvalues (λi ) and eigenvectors (Vi ) for each
79
EC

29 1.24 80
⎢3.39⎥ transformation matrix (Wi ) are calculated, and the val-
30
⎢ ⎥ 81
⎢1.92⎥ ues of the eigenvalues and eigenvectors are shown be-
y2 = ω2 V2 = ⎢ ⎥
31 82
⎢0.56⎥ (27) low.
32
⎢ ⎥ 83
⎣1.18⎦  
R

33 84
0.00 0.00
34 1.86 λω 1 = and 85
0.00 30.01
R

35
  (30) 86
36 Figure 4 illustrates a probability density function −0.69 0.65 87
Vω1 =
(pdf) graph of the projected data (yi ) on the two eigen- −0.72 −0.76
O

37 88
38 vectors (V1 and V2 ). A comparison of the two eigen-   89
vectors reveals the following: 3.14 0.00
λω2 =
C

39 and 90
0.00 0.00
40 • The data of each class is completely discriminated   (31) 91
N

41 when it is projected on the second eigenvector 0.75 0.69 92


Vω2 =
42 (see Fig. 4(b)) than the first one (see Fig. 4(a). −0.66 0.72 93

In other words, the second eigenvector maximizes


U

43 94
44 the between-class variance more than the first where λωi and Vωi represent the eigenvalues and eigen- 95
45 one. vectors of the ith class, respectively. 96
46 • The within-class variance (i.e. the variance be- From the results shown (above) it can be seen that, 97
{2}
47 tween the same class samples) of the two classes the second eigenvector of the first class (Vω1 ) has cor- 98
48 are minimized when the data are projected on the responding eigenvalue more than the first one; thus, 99
49 second eigenvector. As shown in Fig. 4(b), the the second eigenvector is used as a lower dimensional 100
{2}
50 within-class variance of the first class is small space for the first class as follows, y1 = ω1 ∗ Vω1 , 101
51 compared with Fig. 4(a). where y1 represents the projection of the samples of 102
[research-article] p. 11/22

A. Tharwat et al. / Linear discriminant analysis: A detailed tutorial 11

1 ∗ The projected data on the second eigenvec- 52


2 tor (V2 ) which has the highest corresponding 53
3 eigenvalue will discriminate the data of the 54
4 55
two classes better than the first eigenvector. As
5 56
shown in the figure, the distance between the
6 57
projected means m1 − m2 which represents SB ,
7 58
increased when the data are projected on V2
8 59
than V1 .
9 60
10
∗ The second eigenvector decreases the within- 61
11
class variance much better than the first eigen- 62
Fig. 5. Probability density function (pdf) of the projected data using

F
12 {2}
class-dependent method, the first class is projected on Vω1 , while
vector. Figure 6 illustrates that the within-class 63
{1} variance of the first class (SW1 ) was much

O
13 the second class is projected on Vω2 . 64
14 smaller when it was projected on V2 than V1 . 65
15 ∗ As a result of the above two findings, V2 is used 66

O
the first class. While, the first eigenvector in the second
16 {1} to construct the LDA space. 67
class (Vω2 ) has corresponding eigenvalue more than
17 68
{1}
• Class-Dependent: As shown from the figure, there

PR
18
the second one. Thus, Vω2 is used to project the data of 69
{1} {2} {1}
19 the second class as follows, y2 = ω2 ∗ Vω2 , where y2 are two eigenvectors, Vω1 (red line) and Vω2 70
20 represents the projection of the samples of the second (blue line), which represent the first and second 71
21 class. The values of y1 and y2 will be as follows: classes, respectively. The differences between the 72
22 ⎡ ⎤ two eigenvectors are as following: 73
⎡ ⎤ 1.68
D
23
−0.88 ⎢3.76⎥ ∗ Projecting the original data on the two eigen-
74
24 ⎢−1.00⎥ ⎢ ⎥ 75
⎢ ⎥ ⎢2.43⎥ vectors discriminates between the two classes.
y1 = ⎢ ⎥ ⎢ ⎥
TE

⎢−0.35⎥ and y2 = ⎢
25 76
⎥ (32)
26 ⎣−1.24⎦ ⎢0.93⎥ As shown in the figure, the distance between 77
⎣1.77⎦ the projected means m1 − m2 is larger than the
27
−0.59 78
28 2.53 distance between the original means μ1 − μ2 . 79
EC

29 ∗ The within-class variance of each class is de- 80


30 Figure 5 shows a pdf graph of the projected data (i.e. creased. For example, the within-class variance 81
{2} {1}
31 y1 and y2 ) on the two eigenvectors (Vω1 and Vω2 ) and of the first class (SW1 ) is decreased when it is 82
32 a number of findings are revealed the following: projected on its corresponding eigenvector. 83
R

{2}
33
• First, the projection data of the two classes are ∗ As a result of the above two findings, Vω1 and 84
34 {1} 85
efficiently discriminated. Vω2 are used to construct the LDA space.
R

35 86
• Second, the within-class variance of the projected
36
samples is lower than the within-class variance of • Class-Dependent vs. Class-Independent: The two 87
O

37 88
the original samples. LDA methods are used to calculate the LDA
38 89
space, but a class-dependent method calculates
C

39 90
separate lower dimensional spaces for each class
40 91
3.3. Discussion which has two main limitations: (1) it needs
N

41 92
more CPU time and calculations more than class-
42 93
In these two numerical examples, the LDA space is independent method; (2) it may lead to SSS prob-
U

43 94
calculated using class-dependent and class-independent lem because the number of samples in each class
44 95
methods. affects the singularity of SWi .2
45 96
Figure 6 shows a further explanation of the two
46 97
methods as following: These findings reveal that the standard LDA technique
47 98
used the class-independent method rather than using
48 • Class-Independent: As shown from the figure, 99
the class-dependent method.
49 there are two eigenvectors, V1 (dotted black line) 100
50 and V2 (solid black line). The differences between 101
51 the two eigenvectors are as follows: 2 SSS problem will be explained in Section 4.2. 102
[research-article] p. 12/22

12 A. Tharwat et al. / Linear discriminant analysis: A detailed tutorial

1 52
2 53
3 54
4 55
5 56
6 57
7 58
8 59
9 60
10 61
11 62

F
12 63

O
13 64
14 65
15 66

O
16 67
17 68

PR
18 69
19 70
20 71
21 72
22 73
D
23 74
24 75
TE

25 76
26 77
27 78
28 79
Fig. 6. Illustration of the example of the two different methods of LDA methods. The blue and red lines represent the first and second eigenvectors
EC

29 80
of the class-dependent approach, respectively, while the solid and dotted black lines represent the second and first eigenvectors of class-indepen-
30 dent approach, respectively. 81
31 82
32 4. Main problems of LDA The mathematical interpretation for this problem is as 83
R

33 follows: if the means of the classes are approximately 84


34 Although LDA is one of the most common data re- equal, so the SB and W will be zero. Hence, the LDA 85
R

35 duction techniques, it suffers from two main problems: space cannot be calculated. 86
36 the Small Sample Size (SSS) and linearity problems. One of the solutions of this problem is based on the 87
O

37 In the next two subsections, these two problems will transformation concept, which is known as a kernel
88
38 be explained, and some of the state-of-the-art solutions methods or functions [3,50]. Figure 7 illustrates how
89
C

39 are highlighted. the transformation is used to map the original data into
90
40 91
a higher dimensional space; hence, the data will be lin-
N

41 4.1. Linearity problem 92


42
early separable, and the LDA technique can find the 93
lower dimensional space in the new space. Figure 8
U

43 LDA technique is used to find a linear transforma- 94


44 tion that discriminates between different classes. How- graphically and mathematically shows how two non- 95
45 ever, if the classes are non-linearly separable, LDA can separable classes in one-dimensional space are trans- 96
46 not find a lower dimensional space. In other words, formed into a two-dimensional space (i.e. higher di- 97
47 LDA fails to find the LDA space when the discrimina- mensional space); thus, allowing linear separation. 98
48 tory information are not in the means of classes. Fig- The kernel idea is applied in Support Vector Ma- 99
49 ure 7 shows how the discriminatory information does chine (SVM) [49,66,69] Support Vector Regression 100
50 not exist in the mean, but in the variance of the data. (SVR) [58], PCA [51], and LDA [50]. Let φ represents 101
51 This is because the means of the two classes are equal. a nonlinear mapping to the new feature space Z. The 102
[research-article] p. 13/22

A. Tharwat et al. / Linear discriminant analysis: A detailed tutorial 13

1 52
2 53
3 54
4 55
5 56
6 57
7 58
8 59
9 60
10 61
11 62

F
12 63

O
13 64
14 65
15 66

O
16 67
17 68

PR
18 69
19 70
20 71
21 72
22
Fig. 8. Example of kernel functions, the samples lie on the top panel 73
(X) which are represented by a line (i.e. one-dimensional space) are
D
23 74
non-linearly separable, where the samples lie on the bottom panel
24 (Z) which are generated from mapping the samples of the top space 75
are linearly separable.
TE

25 76
26
Fig. 7. Two examples of two non-linearly separable classes, top φ  i 77
27 where μi = n1i ni=1 φ{xi } and μφ = N1 × 78
panel shows how the two classes are non-separable, while the bot- N c ni φ
i=1 φ{xi } =
28 79
i=1 N μi .
tom shows how the transformation solves this problem and the two
EC

29 classes are linearly separable. Thus, in kernel LDA, all samples are transformed 80
30
non-linearly into a new space Z using the function φ. 81
31 82
transformation matrix (W ) in the new feature space In other words, the φ function is used to map the
32
original features into Z space by creating a non- 83
(Z) is calculated as in Equation (33).
R

33 84
linear combination of the original samples using a
34 85
dot-products of it [3]. There are many types of ker-
R

35
 T φ  nel functions to achieve this aim. Examples of these 86
 W SB W 
F (W ) = max 
36 87
(33) function include Gaussian or Radial Basis Function

O

φ
37
W T SW W (RBF), K(xi , xj ) = exp(−xi − xj 2 /2σ 2 ), where 88
38 σ is a positive parameter, and the polynomial kernel 89

of degree d, K(xi , xj ) = (xi , xj  + c)d , [3,29,51,


C

39 90
where W is a transformation matrix and Z is the new
40 φ 72]. 91
feature space. The between-class matrix (SB ) and the
N

41 92
φ
42
within-class matrix (SW ) are defined as follows:
4.2. Small sample size problem 93
U

43 94
44 
c
   4.2.1. Problem definition 95
φ φ φ φ T
SB = n i μi − μ φ
μi − μ (34)
45 Singularity, Small Sample Size (SSS), or under- 96
i=1
46 sampled problem is one of the big problems of LDA 97
nj
47
φ

c 
 φ
technique. This problem results from high-dimensional 98
48 SW = φ{xij } − μj pattern classification tasks or a low number of train- 99
49 j =1 i=1 ing samples available for each class compared with 100
50  φ T the dimensionality of the sample space [30,38,82, 101
51
× φ{xij } − μj (35) 85]. 102
[research-article] p. 14/22

14 A. Tharwat et al. / Linear discriminant analysis: A detailed tutorial

1 The SSS problem occurs when the SW is singular.3 Four different variants of the LDA technique that are 52
2 The upper bound of the rank4 of SW is N − c, while used to solve the SSS problem are introduced as fol- 53
3 the dimension of SW is M × M [17,38]. Thus, in most lows: 54
4 cases M  N − c which leads to SSS problem. For 55
PCA+LDA technique. In this technique, the orig-
5 example, in face recognition applications, the size of 56
6 the face image my reach to 100×100 = 10,000 pixels, inal d-dimensional features are first reduced to h- 57
7 which represent high-dimensional features and it leads dimensional feature space using PCA, and then the 58
8 to a singularity problem. LDA is used to further reduce the features to k- 59
9
dimensions. The PCA is used in this technique to re- 60
10
4.2.2. Common solutions to SSS problem: duce the dimensions to make the rank of SW is N − c 61
There are many studies that proposed many solu- as reported in [4]; hence, the SSS problem is addressed.
11 62
tions for this problem; each has its advantages and

F
12
However, the PCA neglects some discriminant infor- 63
drawbacks. mation, which may reduce the classification perfor-

O
13 64
14 • Regularization (RLDA): In regularization mance [57,60]. 65
15 method, the identity matrix is scaled by multi- Direct LDA technique. Direct LDA (DLDA) is one 66

O
16 plying it by a regularization parameter (η > 0) of the well-known techniques that are used to solve 67
17 and adding it to the within-class matrix to make the SSS problem. This technique has two main steps 68

PR
18 it non-singular [18,38,45,82]. Thus, the diagonal [83]. In the first step, the transformation matrix, W , is 69
19
components of the within-class matrix are biased computed to transform the training data to the range 70
20
as follows, SW = SW + ηI . However, choosing space of SB . In the second step, the dimensionality of 71
the value of the regularization parameter requires
21 the transformed data is further transformed using some 72
more tuning and a poor choice for this parame-
22 regulating matrices as in Algorithm 4. The benefit of 73
ter can degrade the performance of the method
D
23 the DLDA is that there is no discriminative features are 74
[38,45]. Another problem of this method is that
24 neglected as in PCA+LDA technique [83]. 75
the parameter η is just added to perform the in-
TE

25 76
26
verse of SW and has no clear mathematical inter- Regularized LDA technique. In the Regularized LDA 77
pretation [38,57]. (RLDA), a small perturbation is add to the SW matrix
27 78
• Sub-space: In this method, a non-singular inter- to make it non-singular as mentioned in [18]. This reg-
28 79
mediate space is obtained to reduce the dimen- ularization can be applied as follows:
EC

29 80
sion of the original data to be equal to the rank
30 81
31
of SW ; hence, SW becomes full-rank,5 and then (SW + ηI )−1 SB wi = λi wi (36) 82
SW can be inverted. For example, Belhumeur et
32 83
al. [4] used PCA, to reduce the dimensions of the where η represents a regularization parameter. The di-
R

33 84
original space to be equal to N − c (i.e. the upper agonal components of the SW are biased by adding this
34 85
bound of the rank of SW ). However, as reported small perturbation [13,18]. However, the regularization
R

35 86
in [22], losing some discriminant information is a parameter need to be tuned and poor choice of it can
36 87
common drawback associated with the use of this degrade the generalization performance [57].
O

37 88
method.
38
• Null Space: There are many studies proposed Null LDA technique. The aim of the NLDA technique 89
C

39
to remove the null space of SW to make SW is to find the orientation matrix W , and this can be 90
40
full-rank; hence, invertible. The drawback of this achieved using two steps. In the first step, the range 91
N

41
method is that more discriminant information is space of the SW is neglected, and the data are projected 92
42
lost when the null space of SW is removed, which only on the null space of SW as follows, SW W = 0. In 93

the second step, the aim is to search for W that satis-


U

43 has a negative impact on how the lower dimen- 94


44 sional space satisfies the LDA goal [83]. fies SB W = 0 and maximizes |W T SB W |. The higher 95
45 dimensionality of the feature space may lead to com- 96
3 A matrix is singular if it is square, does not have a matrix in-
46 putational problems. This problem can be solved by 97
47 verse, the determinant is zeros; hence, not all columns and rows are (1) using the PCA technique as a pre-processing step, 98
independent.
48
4 The rank of the matrix represents the number of linearly inde- i.e. before applying the NLDA technique, to reduce the 99
49
pendent rows or columns. dimension of feature space to be N − 1; by removing 100
50 5 A is a full-rank matrix if all columns and rows of the matrix are the null space of ST = SB +SW [57], (2) using the PCA 101
51 independent, (i.e. rank(A) = #rows = #cols) [23]. technique before the second step of the NLDA tech- 102
[research-article] p. 15/22

A. Tharwat et al. / Linear discriminant analysis: A detailed tutorial 15

1 nique [54]. Mathematically, in the Null LDA (NLDA) or dimensions. Due to this high dimensionality, the 52
2 technique, the h column vectors of the transformation computational models need more time to train their 53
3 matrix W = [w1 , w2 , . . . , wh ] are taken to be the models, which may be infeasible and expensive. More- 54
4 null space of the SW as follows, wiT SW wi = 0, ∀i = over, this high dimensionality reduces the classifica- 55
5 1, . . . , h, where wiT SB wi = 0. Hence, M −(N −c) lin- tion performance of the computational model and in- 56
6 early independent vectors are used to form a new ori- creases its complexity. This problem can be solved us- 57
7 entation matrix, which is used to maximize |W T SB W | ing LDA technique to construct a new set of features 58
8 subject to the constraint |W T SW W | = 0 as in Equa- from a large number of original features. There are 59
9 many papers have been used LDA in medical applica- 60
tion (37).
10 tions [8,16,39,52–55]. 61
11 62
 T 

F
12 W = arg max W SB W  (37) 6. Packages 63
|W T SW W |=0

O
13 64
14 In this section, some of the available packages that 65
15 are used to compute the space of LDA variants. For 66

O
16 5. Applications of the LDA technique example, WEKA6 is a well-known Java-based data 67
17 mining tool with open source machine learning soft- 68

PR
18 In many applications, due to the high number of fea- ware such as classification, association rules, regres- 69
19 tures or dimensionality, the LDA technique have been sion, pre-processing, clustering, and visualization. In 70
20 used. Some of the applications of the LDA technique WEKA, the machine learning algorithms can be ap- 71
21 and its variants are described as follows: plied directly on the dataset or called from person’s 72
22
Java code. XLSTAT7 is another data analysis and sta- 73
5.1. Biometrics applications tistical package for Microsoft Excel that has a wide
D
23 74
variety of dimensionality reduction algorithms includ-
24 75
Biometrics systems have two main steps, namely, ing LDA. dChip8 package is also used for visualiza-
TE

25 76
tion of gene expression and SNP microarray including
26 feature extraction (including pre-processing steps) and 77
some data analysis algorithms such as LDA, cluster-
27 recognition. In the first step, the features are extracted 78
ing, and PCA. LDA-SSS9 is a Matlab package, and it
28 from the collected data, e.g. face images, and in the 79
contains several algorithms related to the LDA tech-
EC

29 second step, the unknown samples, e.g. unknown face niques and its variants such as DLDA, PCA+LDA, 80
30 image, is identified/verified. The LDA technique and and NLDA. MASS10 package is based on R, and it has 81
31 its variants have been applied in this application. For functions that are used to perform linear and quadratic 82
32 example, in [10,20,41,68,75,83], the LDA technique discriminant function analysis. Dimensionality reduc- 83
R

33 have been applied on face recognition. Moreover, the tion11 package is mainly written in Matlab, and it has 84
34 LDA technique was used in Ear [84], fingerprint [44], a number of dimensionality reduction techniques such 85
R

35 gait [5], and speech [24] applications. In addition, the as ULDA, QLDA, and KDA. DTREG12 is a software 86
36 LDA technique was used with animal biometrics as in package that is used for medical data and modeling 87
O

37 [19,65]. business, and it has several predictive modeling meth- 88


38 ods such as LDA, PCA, linear regression, and decision 89
5.2. Agriculture applications trees.
C

39 90
40 91
In agriculture applications, an unknown sample can
N

41 92
7. Experimental results and discussion
42 be classified into a pre-defined species using computa- 93
tional models [64]. In this application, different vari- In this section, two experiments were conducted to
U

43 94
44 ants of the LDA technique was used to reduce the di- illustrate: (1) how the LDA is used for different appli- 95
45 mension of the collected features as in [9,21,26,46,63, 96
6 http:/www.cs.waikato.ac.nz/ml/weka/
46 64]. 97
7 http://www.xlstat.com/en/
47 98
8 https://sites.google.com/site/dchipsoft/home
48 5.3. Medical applications 9 http://www.staff.usp.ac.fj/sharma_al/index.htm 99
49 10 http://www.statmethods.net/advstats/discriminant.html 100
50 In medical applications, the data such as the DNA 11 http://www.public.asu.edu/*jye02/Software/index.html 101
51 microarray data consists of a large number of features 12 http://www.dtreg.com/index.htm 102
[research-article] p. 16/22

16 A. Tharwat et al. / Linear discriminant analysis: A detailed tutorial

Table 2
1 cations, (2) what is the relation between its parame- 52
Dataset description
2 ter (Eigenvectors) and the accuracy of a classification 53
3 problem, (3) when the SSS problem could appear and Dataset Dimension (M) No. of Samples (N ) No. of classes (c) 54
4 a method for solving it. ORL64×64 4096 400 40 55
5 ORl32×32 1024 56
6 7.1. Experimental setup Ear64×64 4096 102 17 57
7 Ear32×32 1024 58
8
This section gives an overview of the databases, Yale64×64 4096 165 15 59
9
the platform, and the machine specification used to Yale32×32 1024 60
10 61
conduct our experiments. Different biometric datasets
11
were used in the experiments to show how the LDA the classification accuracy, the Nearest Neighbour clas- 62

F
12
using its parameter behaves with different data. These sifier was used. This classifier aims to classify the test- 63

O
13
datasets are described as follows: ing image by comparing its position in the LDA space 64
14 with the positions of training images. Furthermore, 65
15 • ORL dataset13 face images dataset (Olivetti Re- class-independent LDA was used in all experiments. 66

O
16 search Laboratory, Cambridge) [48], which con- Moreover, Matlab Platform (R2013b) and using a PC 67
17 sists of 40 distinct individuals, was used. In this with the following specifications: Intel(R) Core(TM) 68

PR
18 dataset, each individual has ten images taken at i5-2400 CPU @ 3.10 GHz and 4.00 GB RAM, under 69
19 different times and varying light conditions. The Windows 32-bit operating system were used in our ex- 70
20 size of each image is 92 × 112. periments. 71
21 • Yale dataset14 is another face images dataset 72
22 which contains 165 grey scale images in GIF for- 7.2. Experiment on LDA parameter (eigenvectors) 73
mat of 15 individuals [78]. Each individual has
D
23 74
24 11 images in different expressions and configura- The aim of this experiment is to investigate the re- 75
tion: center-light, happy, left-light, with glasses, lation between the number of eigenvectors used in the
TE

25 76
26 normal, right-light, sad, sleepy, surprised, and a LDA space, and the classification accuracy based on 77
27 wink. these eigenvectors and the required CPU time for this 78
28 • 2D ear dataset (Carreira-Perpinan,1995)15 images classification. 79
EC

29 dataset [7] was used. The ear data set consists of As explained earlier that the LDA space consists of 80
30 17 distinct individuals. Six views of the left pro- k eigenvectors, which are sorted according to their ro- 81
31 file from each subject were taken under a uniform, bustness (i.e. their eigenvalues). The robustness of each 82
32 diffuse lighting. eigenvector reflects its ability to discriminate between 83
R

33
In all experiments, k-fold cross-validation tests have different classes. Thus, in this experiment, it will be 84
34
used. In k-fold cross-validation, the original samples of checked whether increasing the number of eigenvec- 85
R

35
the dataset were randomly partitioned into k subsets of tors would increase the total robustness of the con- 86
36
(approximately) equal size and the experiment is run k structed LDA space; hence, different classes could be 87
O

37
times. For each time, one subset was used as the test- well discriminated. Also, it will be tested whether in- 88
38
ing set and the other k − 1 subsets were used as the creasing the number of eigenvectors would increase the 89
C

39
training set. The average of the k results from the folds dimension of the LDA space and the projected data; 90
40
can then be calculated to produce a single estimation. hence, CPU time increases. To investigate these is- 91
N

41
In this study, the value of k was set to 10. sue, three datasets listed in Table 2 (i.e. ORL32×32 , 92
42
The images in all datasets resized to be 64 × 64 and Ear32×32 , Yale32×32 ), were used. Moreover, seven, 93

four, and eight images from each subject in the ORL,


U

43 94
32 × 32 as shown in Table 2. Figure 9 shows samples
44
of the used datasets and Table 2 shows a description of ear, Yale datasets, respectively, are used in this exper- 95
45
the datasets used in our experiments. iment. The results of this experiment are presented in 96
46
In all experiments, to show the effect of the LDA Fig. 10. 97
47
with its eigenvector parameter and its SSS problem on From Fig. 10 it can be noticed that the accuracy and 98
48 CPU time are proportional with the number of eigen- 99
49 13 http://www.cam-orl.co.uk vectors which are used to construct the LDA space. 100
50 14 http://vision.ucsd.edu/content/yale-face-database Thus, the choice of using LDA in a specific applica- 101
51 15 http://faculty.ucmerced.edu/mcarreira-perpinan/software.html tion should consider a trade-off between these factors. 102
[research-article] p. 17/22

A. Tharwat et al. / Linear discriminant analysis: A detailed tutorial 17

1 52
2 53
3 54
4 55
5 56
6 57
7 58
8 59
9 60
10 61
11 62

F
12 63

O
13 64
14 65
15
Fig. 9. Samples of the first individual in: ORL face dataset (top row); Ear dataset (middle row), and Yale face dataset (bottom row). 66

O
16 67
17
Moreover, from Fig. 10(a), it can be remarked that 68

PR
18
when the number of eigenvectors used in computing 69
19
the LDA space was increased, the classification accu- 70
20 racy was also increased to a specific extent after which 71
21 the accuracy remains constant. As seen in Fig. 10(a), 72
22 this extent differs from application to another. For ex- 73
ample, the accuracy of the ear dataset remains con-
D
23 74
24 stant when the percentage of the used eigenvectors is 75
more than 10%. This was expected as the eigenvec-
TE

25 76
26 tors of the LDA space are sorted according to their ro- 77
27 bustness (see Section 2.6). Similarly, in ORL and Yale 78
28 datasets the accuracy became approximately constant 79
EC

29 when the percentage of the used eigenvectors is more 80


30 than 40%. In terms of CPU time, Fig. 10(b) shows the 81
31 CPU time using different percentages of the eigenvec- 82
32 tors. As shown, the CPU time increased dramatically 83
R

33 when the number of eigenvectors increases. 84


34
Figure 11 shows the weights16 of the first 40 eigen- 85
R

35 86
vectors which confirms our findings. From these re-
36 87
sults, we can conclude that the high order eigenvectors
O

37 88
of the data of each application (the first 10% of ear
38 89
database and the first 40% of ORL and Yale datasets)
C

39 90
are robust enough to extract and save the most discrim-
40 91
inative features which are used to achieve a good accu-
N

41 92
racy.
42 93
These experiments confirmed that increasing the
U

43 94
44
number of eigenvectors will increase the dimension of 95
45
the LDA space; hence, CPU time increases. Conse- 96
46
quently, the amount of discriminative information and 97
47 the accuracy increases. 98
48 99
49 16 The weight of the eigenvector represents the ratio of its cor- 100
Fig. 10. Accuracy and CPU time of the LDA techniques using dif- responding eigenvalue (λi ) to the total of all eigenvalues (λi , i =
50 101
ferent percentages of eigenvectors, (a) Accuracy (b) CPU time. λ
51 1, 2, . . . , k) as follows, k i . 102
j =1 λj
[research-article] p. 18/22

18 A. Tharwat et al. / Linear discriminant analysis: A detailed tutorial

1 Algorithm 3 PCA-LDA 52
2 1: Read the training images (X = {x1 , x2 , . . . , xN }), 53
3 where xi (ro×co) represents the ith training image, 54
4 ro and co represent the rows (height) and columns 55
5 (width) of xi , respectively, N represents the total 56
6 number of training images. 57
7 2: Convert all images in vector representation Z = 58
8 {z1 , z2 , . . . , zN }, where the dimension of Z is 59
9 M × 1, M = ro × co. 60
10 3: calculate the mean of each class μi , total mean of 61
11 all data μ, between-class matrix SB (M × M), and 62

F
12 within-class matrix SW (M × M) as in Algorithm 1 63

O
13 (Step (2–5)). 64
14 4: Use the PCA technique to reduce the dimension of 65
15 Fig. 11. The robustness of the first 40 eigenvectors of the LDA tech- X to be equal to or lower than r, where r represents 66

O
16 nique using ORL32×32 , Ear32×32 , and Yale32×32 datasets. the rank of SW , as follows: 67
17 68

PR
18 7.3. Experiments on the small sample size problem XPCA = U X T
(38) 69
19 70
20 The aim of this experiment is to show when the LDA where, U ∈ RM×r is the lower dimensional space 71
21 is subject to the SSS problem and what are the methods of the PCA and XPCA represents the projected data 72
22 that could be used to solve this problem. In this experi- on the PCA space. 73

ment, PCA-LDA [4] and Direct-LDA [22,83] methods


D
23 5: Calculate the mean of each class μi , total mean of 74
24 are used to address the SSS problem. A set of exper- all data μ, between-class matrix SB , and within- 75

iments was conducted to show how the two methods


TE

25 class matrix SW of XPCA as in Algorithm 1 (Step 76


26 interact with the SSS problem, including the number of (2–5)). 77
27 samples in each class, the total number of classes used, 6: Calculate W as in Equation (7) and then calculate 78
28 and the dimension of each sample. the eigenvalues (λ) and eigenvectors (V ) of the W 79
EC

29 As explained in Section 4, using LDA directly may matrix. 80


30 lead to the SSS problem when the dimension of the 7: Sorting the eigenvectors in descending order ac- 81
31 samples are much higher than the total number of sam- cording to their corresponding eigenvalues. The 82
32 ples. As shown in Table 2, the size of each image or first k eigenvectors are then used as a lower dimen- 83
R

33 sample of the ORL dataset is 64×64 is 4096 and the to- sional space (Vk ). 84
34 tal number of samples is 400. The mathematical inter- 8: The original samples (X) are first projected on the 85
R

35 pretation of this point shows that the dimension of SW PCA space as in Equation (38). The projection on 86
36 is M × M, while the upper bound of the rank of SW is the LDA space is then calculated as follows: 87
O

37 N −c [79,82]. Thus, all the datasets which are reported 88


38 in Table 2 lead to singular SW . XLDA = VkT XPCA = VkT U T X (39) 89
C

39 To address this problem, PCA-LDA and direct-LDA 90


40 methods were implemented. In the PCA-LDA method, where XLDA represents the final projected on the 91
N

41 PCA is first used to reduce the dimension of the origi- LDA space. 92
42 nal data to make SW full-rank, and then standard LDA 93

can be performed safely in the subspace of SW as in Al-


U

43 94
44 gorithm 3. For more details of the PCA-LDA method and the accuracy. The results of these scenarios using 95
45 are reported in [4]. In the direct-LDA method, the null- both PCA-LDA and direct-LDA methods are summa- 96
46 space of SW matrix is removed to make SW full-rank, rized in Table 3. 97
47 then standard LDA space can be calculated as in Algo- As summarized in Table 3, the rank of SW is very 98
48 rithm 4. More details of direct-LDA methods are found small compared to the whole dimension of SW ; hence, 99
49 in [83]. the SSS problem occurs in all cases. As shown in 100
50 Table 2 illustrates various scenarios designed to test Table 3, using the PCA-LDA and the Direct-LDA, 101
51 the effect of different dimensions on the rank of SW the SSS problem can be solved and the Direct-LDA 102
[research-article] p. 19/22

A. Tharwat et al. / Linear discriminant analysis: A detailed tutorial 19

1 Table 3 52
2 Accuracy of the PCA-LDA and direct-LDA methods using the datasets listed in Table 2 53
3 Dataset Dim(SW ) # Training images # Testing images Rank(SW ) Accuracy (%) 54
4 PCA-LDA Direct-LDA 55
5 ORL32×32 1024 × 1024 5 5 160 75.5 88.5 56
6 7 3 240 75.5 97.5 57
7 9 1 340 80.39 97.5 58
8 Ear32×32 1024 × 1024 3 3 34 80.39 96.08 59
9 4 2 51 88.24 94.12 60
10 5 1 68 100 100 61
11 Yale32×32 1024 × 1024 6 5 75 78.67 90.67 62

F
12 8 3 105 84.44 97.79 63
10 1 135 100 100

O
13 64
14 ORL64×64 4096 × 4096 5 5 160 72 87.5 65
15 7 3 240 81.67 96.67 66

O
16 9 1 340 82.5 97.5 67
17 Ear64×64 4096 × 4096 3 3 34 74.5 96.08 68

PR
18 4 2 51 91.18 96.08 69
19 5 1 68 100 100 70
20 Yale64×64 4096 × 4096 6 5 75 74.67 92 71
21 8 3 105 95.56 97.78 72
22 10 1 135 93.33 100 73
The bold values indicate that the corresponding methods obtain best performances.
D
23 74
24 75
Algorithm 4 Direct-LDA method achieved results better than PCA-LDA because
TE

25 76
26 1: Read the training images (X = {x1 , x2 , . . . , xN }), as reported in [22,83], Direct-LDA method saves ap- 77
27 where xi (ro×co) represents the ith training image, proximately all important information for classifica- 78
28 ro and co represent the rows (height) and columns tion while the PCA in PCA-LDA method saves the in- 79
EC

29 (width) of xi , respectively, N represents the total formation with high variance. 80


30 number of training images. 81
31 2: Convert all images in vector representation Z = 82
32 {z1 , z2 , . . . , zN }, where the dimension of Z is M × 8. Conclusions 83
R

33 1, M = ro × co. 84
34 3: Calculate the mean of each class μi , total mean In this paper, the definition, mathematics, and im- 85
R

35 of all data μ, between-class matrix SB (M × M), plementation of LDA were presented and explained. 86
36 and within-class matrix SW (M × M) of X as in The paper aimed to give low-level details on how the 87
O

37 Algorithm 1 (Step (2–5)). LDA technique can address the reduction problem by 88
38 4: Find the k eigenvectors of SB with non- extracting discriminative features based on maximiz- 89

zeros eigenvalues, and denote them as U =


C

39 90
ing the ratio between the between-class variance, SB ,
40 [u1 , u2 , . . . , uk ] (i.e. U T Sb U > 0). and within-class variance, SW , thus discriminating be- 91
N

41 5: Calculate the eigenvalues and eigenvectors of tween different classes. To achieve this aim, the pa- 92
42 U T SW U , and then sort the eigenvalues and dis- per followed the approach of not only explaining the 93

card the eigenvectors which have high values. The


U

43 94
steps of calculating the SB , and SW (i.e. the LDA
44 selected eigenvectors are denoted by V ; thus, V space) but also visualizing these steps with figures 95
45 represents the null space of SW . and diagrams to make it easy to understand. More- 96
46 6: The final LDA matrix () consists of the range17 over, two LDA methods, i.e. class-dependent and class- 97
47 SB and the null space of SW as follows,  = U V . independent, are explained and two numerical exam- 98
48 7: The original data are projected on the LDA space ples were given and graphically illustrated to explain 99
49 as follows, Y = X = XU V . 100
50 17 The range of a matrix A represents the column-space of A 101
51 (C(A)), where the dimension of C(A) is equal to the rank of A. 102
[research-article] p. 20/22

20 A. Tharwat et al. / Linear discriminant analysis: A detailed tutorial

1 how the LDA space using the two methods can be con- [11] D. Coomans, D. Massart and L. Kaufman, Optimization by 52
2 structed. In all examples, the mathematical interpreta- statistical linear discriminant analysis in analytical chemistry, 53
3 tion of the robustness and the selection of the eigenvec- Analytica Chimica Acta 112(2) (1979), 97–122. doi:10.1016/ 54
S0003-2670(01)83513-3.
4 tors as well the data projection were detailed and dis- [12] J. Cui and Y. Xu, Three dimensional palmprint recognition
55
5 cussed. Also, LDA common problems (e.g. the SSS and using linear discriminant analysis method, in: Proceedings of 56
6 linearity) were mathematically explained using graph- the Second International Conference on Innovations in Bio- 57
7 ical examples, then their state-of-the-art solutions are Inspired Computing and Applications (IBICA), 2011, IEEE, 58
8 highlighted. Moreover, a detailed implementation of 2011, pp. 107–111. doi:10.1109/IBICA.2011.31. 59
[13] D.-Q. Dai and P.C. Yuen, Face recognition by regularized dis-
9 LDA applications was presented. Using three standard criminant analysis, IEEE Transactions on Systems, Man, and
60
10 datasets, a number of experiments were conducted to Cybernetics, Part B (Cybernetics) 37(4) (2007), 1080–1085. 61
11 (1) investigate and explain the relation between the doi:10.1109/TSMCB.2007.895363. 62

F
12 number of eigenvectors and the robustness of the LDA [14] D. Donoho and V. Stodden, When does non-negative matrix 63
space, (2) to practically show when the SSS problem factorization give a correct decomposition into parts? in: Ad-

O
13 64
vances in Neural Information Processing Systems 16, MIT
14 occurs and how it can be addressed. 65
Press, 2004, pp. 1141–1148.
15 66

O
[15] R.O. Duda, P.E. Hart and D.G. Stork, Pattern Classification,
16 2nd edn, Wiley, 2012. 67
17 References [16] S. Dudoit, J. Fridlyand and T.P. Speed, Comparison of discrim- 68

PR
18 ination methods for the classification of tumors using gene ex- 69
[1] S. Balakrishnama and A. Ganapathiraju, Linear discriminant pression data, Journal of the American Statistical Association
19 70
analysis-a brief tutorial, Institute for Signal and Information 97(457) (2002), 77–87. doi:10.1198/016214502753479248.
20 71
Processing (1998). [17] T.-t. Feng and G. Wu, A theoretical contribution to the fast im-
21 [2] E. Barshan, A. Ghodsi, Z. Azimifar and M.Z. Jahromi, Su- plementation of null linear discriminant analysis method us- 72
22 pervised principal component analysis: Visualization, classifi- ing random matrix multiplication with scatter matrices, 2014, 73
cation and regression on subspaces and submanifolds, Pattern arXiv preprint arXiv:1409.2579.
D
23 74
Recognition 44(7) (2011), 1357–1371. doi:10.1016/j.patcog. [18] J.H. Friedman, Regularized discriminant analysis, Journal of
24 75
2010.12.015. the American Statistical Association 84(405) (1989), 165–175.
TE

25 76
[3] G. Baudat and F. Anouar, Generalized discriminant analysis doi:10.1080/01621459.1989.10478752.
26 using a kernel approach, Neural Computation 12(10) (2000), [19] T. Gaber, A. Tharwat, A.E. Hassanien and V. Snasel, Biometric 77
27 2385–2404. doi:10.1162/089976600300014980. cattle identification approach based on Weber’s local descriptor 78
28 [4] P.N. Belhumeur, J.P. Hespanha and D. Kriegman, Eigenfaces and adaboost classifier, Computers and Electronics in Agricul- 79
vs. fisherfaces: Recognition using class specific linear projec- ture 122 (2016), 55–66. doi:10.1016/j.compag.2015.12.022.
EC

29 80
tion, IEEE Transactions on Pattern Analysis and Machine In- [20] T. Gaber, A. Tharwat, A. Ibrahim, V. Snáel and A.E. Has-
30 81
telligence 19(7) (1997), 711–720. doi:10.1109/34.598228. sanien, Human thermal face recognition based on random lin-
31 [5] N.V. Boulgouris and Z.X. Chi, Gait recognition using Radon ear oracle (rlo) ensembles, in: International Conference on In- 82
32 transform and linear discriminant analysis, IEEE Transactions telligent Networking and Collaborative Systems, IEEE, 2015, 83
R

33 on Image Processing 16(3) (2007), 731–740. doi:10.1109/TIP. pp. 91–98. doi:10.1109/INCoS.2015.67. 84


2007.891157. [21] T. Gaber, A. Tharwat, V. Snasel and A.E. Hassanien, Plant
34 85
[6] M. Bramer, Principles of Data Mining, 2nd edn, Springer, identification: Two dimensional-based vs. one dimensional-
R

35 86
2013. based feature extraction methods, in: 10th International Con-
36 [7] M. Carreira-Perpinan, Compression neural networks for fea- ference on Soft Computing Models in Industrial and Environ- 87
O

37 ture extraction: Application to human recognition from ear im- mental Applications, Springer, 2015, pp. 375–385. 88
38 ages, MS thesis, Faculty of Informatics, Technical University [22] H. Gao and J.W. Davis, Why direct lda is not equivalent to lda, 89
of Madrid, Spain, 1995. Pattern Recognition 39(5) (2006), 1002–1006. doi:10.1016/j.
C

39 90
[8] H.-P. Chan, D. Wei, M.A. Helvie, B. Sahiner, D.D. Adler, patcog.2005.11.016.
40 91
M.M. Goodsitt and N. Petrick, Computer-aided classification [23] G.H. Golub and C.F. Van Loan, Matrix Computations, 4th edn,
N

41 of mammographic masses and normal tissue: Linear discrimi- Vol. 3, Johns Hopkins University Press, 2012. 92
42 nant analysis in texture feature space, Physics in Medicine and [24] R. Haeb-Umbach and H. Ney, Linear discriminant analysis for 93
Biology 40(5) (1995), 857–876. doi:10.1088/0031-9155/40/5/ improved large vocabulary continuous speech recognition, in:
U

43 94
010. IEEE International Conference on Acoustics, Speech, and Sig-
44 95
[9] L. Chen, N.C. Carpita, W.-D. Reiter, R.H. Wilson, C. Jef- nal Processing (1992), Vol. 1, IEEE, 1992, pp. 13–16.
45 96
fries and M.C. McCann, A rapid method to screen for cell- [25] T. Hastie and R. Tibshirani, Discriminant analysis by Gaus-
46 wall mutants using discriminant analysis of Fourier transform sian mixtures, Journal of the Royal Statistical Society. Series B 97
47 infrared spectra, The Plant Journal 16(3) (1998), 385–392. (Methodological) (1996), 155–176. 98
48 doi:10.1046/j.1365-313x.1998.00301.x. [26] K. Héberger, E. Csomós and L. Simon-Sarkadi, Principal com- 99
[10] L.-F. Chen, H.-Y.M. Liao, M.-T. Ko, J.-C. Lin and G.-J. Yu, ponent and linear discriminant analyses of free amino acids
49 100
A new lda-based face recognition system which can solve the and biogenic amines in Hungarian wines, Journal of Agricul-
50 101
small sample size problem, Pattern Recognition 33(10) (2000), tural and Food Chemistry 51(27) (2003), 8055–8060. doi:10.
51 1713–1726. doi:10.1016/S0031-3203(99)00139-9. 1021/jf034851c. 102
[research-article] p. 21/22

A. Tharwat et al. / Linear discriminant analysis: A detailed tutorial 21

1 [27] G.E. Hinton and R.R. Salakhutdinov, Reducing the dimension- [43] F. Pan, G. Song, X. Gan and Q. Gu, Consistent feature selec- 52
2 ality of data with neural networks, Science 313(5786) (2006), tion and its application to face recognition, Journal of Intelli- 53
3 504–507. doi:10.1126/science.1127647. gent Information Systems 43(2) (2014), 307–321. doi:10.1007/ 54
[28] K. Honda and H. Ichihashi, Fuzzy local independent com- s10844-014-0324-5.
4 55
ponent analysis with external criteria and its application to [44] C.H. Park and H. Park, Fingerprint classification using fast
5 56
knowledge discovery in databases, International Journal of Fourier transform and nonlinear discriminant analysis, Pat-
6 Approximate Reasoning 42(3) (2006), 159–173. doi:10.1016/j. tern Recognition 38(4) (2005), 495–503. doi:10.1016/j.patcog. 57
7 ijar.2005.10.011. 2004.08.013. 58
8 [29] Q. Hu, L. Zhang, D. Chen, W. Pedrycz and D. Yu, Gaussian [45] C.H. Park and H. Park, A comparison of generalized linear 59
kernel based fuzzy rough sets: Model, uncertainty measures discriminant analysis algorithms, Pattern Recognition 41(3)
9 60
and applications, International Journal of Approximate Rea- (2008), 1083–1097. doi:10.1016/j.patcog.2007.07.022.
10 61
soning 51(4) (2010), 453–471. doi:10.1016/j.ijar.2010.01.004. [46] S. Rezzi, D.E. Axelson, K. Héberger, F. Reniero, C. Mariani
11 [30] R. Huang, Q. Liu, H. Lu and S. Ma, Solving the small sam- and C. Guillou, Classification of olive oils using high through- 62

F
12 ple size problem of lda, in: Proceedings of 16th International put flow 1 h nmr fingerprinting with principal component anal- 63
Conference on Pattern Recognition, 2002, Vol. 3, IEEE, 2002, ysis, linear discriminant analysis and probabilistic neural net-

O
13 64
pp. 29–32. works, Analytica Chimica Acta 552(1) (2005), 13–24. doi:10.
14 65
[31] A. Hyvärinen, J. Karhunen and E. Oja, Independent Compo- 1016/j.aca.2005.07.057.
15 66

O
nent Analysis, Vol. 46, Wiley, 2004. [47] Y. Saeys, I. Inza and P. Larrañaga, A review of feature selection
16 [32] M. Kirby, Geometric Data Analysis: An Empirical Approach techniques in bioinformatics, Bioinformatics 23(19) (2007), 67
17 to Dimensionality Reduction and the Study of Patterns, Wiley, 2507–2517. doi:10.1093/bioinformatics/btm344. 68

PR
18 2000. [48] F.S. Samaria and A.C. Harter, Parameterisation of a stochas- 69
[33] D.T. Larose, Discovering Knowledge in Data: An Introduction tic model for human face identification, in: Proceedings of the
19 70
to Data Mining, 1st edn, Wiley, 2014. Second IEEE Workshop on Applications of Computer Vision,
20 71
[34] T. Li, S. Zhu and M. Ogihara, Using discriminant analysis 1994, IEEE, 1994, pp. 138–142.
21 for multi-class classification: An experimental investigation, [49] B. Schölkopf, C.J. Burges and A.J. Smola, Advances in Kernel 72
22 Knowledge and Information Systems 10(4) (2006), 453–472. Methods: Support Vector Learning, MIT Press, 1999. 73
doi:10.1007/s10115-006-0013-y. [50] B. Schölkopf and K.-R. Mullert, Fisher discriminant analysis
D
23 74
[35] K. Liu, Y.-Q. Cheng, J.-Y. Yang and X. Liu, An efficient al- with kernels, in: Proceedings of the 1999 IEEE Signal Process-
24 75
gorithm for foley–sammon optimal set of discriminant vectors ing Society Workshop Neural Networks for Signal Processing
TE

25 76
by algebraic method, International Journal of Pattern Recog- IX, Madison, WI, USA, 1999, pp. 41–48.
26 nition and Artificial Intelligence 6(5) (1992), 817–829. doi:10. [51] B. Schölkopf, A. Smola and K.-R. Müller, Nonlinear com- 77
27 1142/S0218001492000412. ponent analysis as a kernel eigenvalue problem, Neu- 78
28 [36] J. Lu, K.N. Plataniotis and A.N. Venetsanopoulos, Face recog- ral Computation 10(5) (1998), 1299–1319. doi:10.1162/ 79
nition using lda-based algorithms, IEEE Transactions on Neu- 089976698300017467.
EC

29 80
ral Networks 14(1) (2003), 195–200. doi:10.1109/TNN.2002. [52] A. Sharma, S. Imoto and S. Miyano, A between-class over-
30 81
806647. lapping filter-based method for transcriptome data analysis,
31 [37] J. Lu, K.N. Plataniotis and A.N. Venetsanopoulos, Regularized Journal of Bioinformatics and Computational Biology 10(5) 82
32 discriminant analysis for the small sample size problem in face (2012), 1250010. doi:10.1142/S0219720012500102. 83
R

33 recognition, Pattern Recognition Letters 24(16) (2003), 3079– [53] A. Sharma, S. Imoto and S. Miyano, A filter based feature se- 84
3087. doi:10.1016/S0167-8655(03)00167-3. lection algorithm using null space of covariance matrix for dna
34 85
[38] J. Lu, K.N. Plataniotis and A.N. Venetsanopoulos, Regular- microarray gene expression data, Current Bioinformatics 7(3)
R

35 86
ization studies of linear discriminant analysis in small sam- (2012), 289–294. doi:10.2174/157489312802460802.
36 ple size scenarios with application to face recognition, Pat- [54] A. Sharma, S. Imoto, S. Miyano and V. Sharma, Null space 87
O

37 tern Recognition Letters 26(2) (2005), 181–191. doi:10.1016/ based feature selection method for gene expression data, In- 88
38 j.patrec.2004.09.014. ternational Journal of Machine Learning and Cybernetics 3(4) 89
[39] B. Moghaddam, Y. Weiss and S. Avidan, Generalized spectral (2012), 269–276. doi:10.1007/s13042-011-0061-9.
C

39 90
bounds for sparse lda, in: Proceedings of the 23rd International [55] A. Sharma and K.K. Paliwal, Cancer classification by gradient
40 91
Conference on Machine Learning, ACM, 2006, pp. 641–648. lda technique using microarray gene expression data, Data &
N

41 [40] W. Müller, T. Nocke and H. Schumann, Enhancing the visu- Knowledge Engineering 66(2) (2008), 338–347. doi:10.1016/ 92
42 alization process with principal component analysis to support j.datak.2008.04.004. 93
the exploration of trends, in: Proceedings of the 2006 Asia- [56] A. Sharma and K.K. Paliwal, A new perspective to null linear
U

43 94
Pacific Symposium on Information Visualisation, Vol. 60, Aus- discriminant analysis method and its fast implementation us-
44 95
tralian Computer Society, Inc., 2006, pp. 121–130. ing random matrix multiplication with scatter matrices, Pattern
45 96
[41] S. Noushath, G.H. Kumar and P. Shivakumara, (2d) 2 lda: An Recognition 45(6) (2012), 2205–2213. doi:10.1016/j.patcog.
46 efficient approach for face recognition, Pattern Recognition 2011.11.018. 97
47 39(7) (2006), 1396–1400. doi:10.1016/j.patcog.2006.01.018. [57] A. Sharma and K.K. Paliwal, Linear discriminant analysis for 98
48 [42] K.K. Paliwal and A. Sharma, Improved pseudoinverse lin- the small sample size problem: An overview, International 99
ear discriminant analysis method for dimensionality re- Journal of Machine Learning and Cybernetics (2014), 1–12.
49 100
duction, International Journal of Pattern Recognition and [58] A.J. Smola and B. Schölkopf, A tutorial on support vector
50 101
Artificial Intelligence 26(1) (2012), 1250002. doi:10.1142/ regression, Statistics and Computing 14(3) (2004), 199–222.
51 S0218001412500024. doi:10.1023/B:STCO.0000035301.49549.88. 102
[research-article] p. 22/22

22 A. Tharwat et al. / Linear discriminant analysis: A detailed tutorial

1 [59] G. Strang, Introduction to Linear Algebra, 4th edn, Wellesley- [72] V. Vapnik, The Nature of Statistical Learning Theory, 2nd edn, 52
2 Cambridge Press, Massachusetts, 2003. Springer, New York, 2013. 53
3 [60] D.L. Swets and J.J. Weng, Using discriminant eigenfeatures [73] J. Venna, J. Peltonen, K. Nybo, H. Aidos and S. Kaski, Infor- 54
for image retrieval, IEEE Transactions on Pattern Analysis and mation retrieval perspective to nonlinear dimensionality reduc-
4 55
Machine Intelligence 18(8) (1996), 831–836. doi:10.1109/34. tion for data visualization, The Journal of Machine Learning
5 56
531802. Research 11 (2010), 451–490.
6 [61] M.M. Tantawi, K. Revett, A. Salem and M.F. Tolba, Fiducial [74] P. Viszlay, M. Lojka and J. Juhár, Class-dependent two- 57
7 feature reduction analysis for electrocardiogram (ecg) based dimensional linear discriminant analysis using two-pass recog- 58
8 biometric recognition, Journal of Intelligent Information Sys- nition strategy, in: Proceedings of the 22nd European Signal 59
9
tems 40(1) (2013), 17–39. doi:10.1007/s10844-012-0214-7. Processing Conference (EUSIPCO), IEEE, 2014, pp. 1796– 60
[62] A. Tharwat, Principal component analysis-a tutorial, Interna- 1800.
10 61
tional Journal of Applied Pattern Recognition 3(3) (2016), [75] X. Wang and X. Tang, Random sampling lda for face recogni-
11 197–240. doi:10.1504/IJAPR.2016.079733. 62
tion, in: Proceedings of the 2004 IEEE Computer Society Con-

F
12 [63] A. Tharwat, T. Gaber, Y.M. Awad, N. Dey and A.E. Hassanien, ference on Computer Vision and Pattern Recognition (CVPR), 63
Plants identification using feature fusion technique and bag- Vol. 2, IEEE, 2004, pp. II–II.

O
13 64
14
ging classifier, in: The 1st International Conference on Ad- [76] M. Welling, Fisher Linear Discriminant Analysis, Vol. 3, De- 65
vanced Intelligent System and Informatics (AISI2015), Beni partment of Computer Science, University of Toronto, 2005.
15 66

O
Suef, Egypt, November 28–30, 2015, Springer, 2016, pp. 461– [77] M.C. Wu, L. Zhang, Z. Wang, D.C. Christiani and X. Lin,
16 471. 67
Sparse linear discriminant analysis for simultaneous testing
17 [64] A. Tharwat, T. Gaber and A.E. Hassanien, One-dimensional for the significance of a gene set/pathway and gene selec- 68

PR
18 vs. two-dimensional based features: Plant identification ap- tion, Bioinformatics 25(9) (2009), 1145–1151. doi:10.1093/ 69
19 proach, Journal of Applied Logic (2016). bioinformatics/btp019. 70
[65] A. Tharwat, T. Gaber, A.E. Hassanien, H.A. Hassanien and [78] J. Yang, D. Zhang, A.F. Frangi and J.-y. Yang, Two-
20 71
M.F. Tolba, Cattle identification using muzzle print images dimensional pca: A new approach to appearance-based face
21 72
based on texture features approach, in: Proceedings of the Fifth representation and recognition, IEEE Transactions on Pattern
22 International Conference on Innovations in Bio-Inspired Com- Analysis and Machine Intelligence 26(1) (2004), 131–137. 73
puting and Applications IBICA 2014, Springer, 2014, pp. 217– doi:10.1109/TPAMI.2004.1261097.
D
23 74
24 227. [79] L. Yang, W. Gong, X. Gu, W. Li and Y. Liang, Null space 75
[66] A. Tharwat, A.E. Hassanien and B.E. Elnaghi, A ba-based al- discriminant locality preserving projections for face recogni-
TE

25 76
gorithm for parameter optimization of support vector machine, tion, Neurocomputing 71(16) (2008), 3644–3649. doi:10.1016/
26 77
Pattern Recognition Letters (2016). j.neucom.2008.03.009.
27 [67] A. Tharwat, A. Ibrahim, A.E. Hassanien and G. Schaefer, [80] W. Yang and H. Wu, Regularized complete linear discriminant 78
28 Ear recognition using block-based principal component anal- analysis, Neurocomputing 137 (2014), 185–191. doi:10.1016/ 79
EC

29 ysis and decision fusion, in: International Conference on Pat- j.neucom.2013.08.048. 80


tern Recognition and Machine Intelligence, Springer, 2015, [81] J. Ye, R. Janardan and Q. Li, Two-dimensional linear discrim-
30 81
pp. 246–254. doi:10.1007/978-3-319-19941-2_24. inant analysis, in: Proceedings of 17th Advances in Neural In-
31 82
[68] A. Tharwat, H. Mahdi, A. El Hennawy and A.E. Hassanien, formation Processing Systems (NIPS), 2004, pp. 1569–1576.
32 Face sketch synthesis and recognition based on linear regres- [82] J. Ye and T. Xiong, Computational and theoretical analysis 83
R

33 sion transformation and multi-classifier technique, in: The 1st of null space and orthogonal linear discriminant analysis, The 84
34 International Conference on Advanced Intelligent System and Journal of Machine Learning Research 7 (2006), 1183–1204. 85
[83] H. Yu and J. Yang, A direct lda algorithm for high-
R

35
Informatics (AISI2015), Beni Suef, Egypt, November 28–30, 86
2015, Springer, 2016, pp. 183–193. dimensional data with application to face recognition, Pattern
36 87
[69] A. Tharwat, Y.S. Moemen and A.E. Hassanien, Classification Recognition 34(10) (2001), 2067–2070. doi:10.1016/S0031-
O

37 of toxicity effects of biotransformed hepatic drugs using whale 3203(00)00162-X. 88


38 optimized support vector machines, Journal of Biomedical In- [84] L. Yuan and Z.-c. Mu, Ear recognition based on 2d images, 89
formatics (2017). in: Proceedings of the First IEEE International Conference on
C

39 90
[70] C.G. Thomas, R.A. Harshman and R.S. Menon, Noise reduc- Biometrics: Theory, Applications, and Systems, 2007, BTAS
40 91
tion in bold-based fmri using component analysis, Neuroimage 2007, IEEE, 2007, pp. 1–5.
N

41 92
17(3) (2002), 1521–1537. doi:10.1006/nimg.2002.1200. [85] X.-S. Zhuang and D.-Q. Dai, Inverse Fisher discriminate cri-
42 teria for small sample size problem and its application to face 93
[71] M. Turk and A. Pentland, Eigenfaces for recognition, Journal
recognition, Pattern Recognition 38(11) (2005), 2192–2194.
U

43 of Cognitive Neuroscience 3(1) (1991), 71–86. doi:10.1162/ 94


44 jocn.1991.3.1.71. doi:10.1016/j.patcog.2005.02.011. 95
45 96
46 97
47 98
48 99
49 100
50 101
51 102

View publication stats

You might also like