Feature Selection Using Genetic Algorithms
Feature Selection Using Genetic Algorithms
SJSU ScholarWorks
Master's Projects Master's Theses and Graduate Research
Spring 2018
Recommended Citation
Kannan, Vandana, "Feature Selection using Genetic Algorithms" (2018). Master's Projects. 618.
DOI: https://doi.org/10.31979/etd.6mq4-cp5p
https://scholarworks.sjsu.edu/etd_projects/618
This Master's Project is brought to you for free and open access by the Master's Theses and Graduate Research at SJSU ScholarWorks. It has been
accepted for inclusion in Master's Projects by an authorized administrator of SJSU ScholarWorks. For more information, please contact
[email protected].
Feature Selection using Genetic Algorithms
A project
Presented to
In Partial Fulfilment
Master of Science
by
Vandana Kannan
May 2018
©2018
Vandana Kannan
by
Vandana Kannan
May 2018
by Vandana Kannan
With the large amount of data of different types that are available today, the number of features
that can be extracted from it is huge. The ever-increasing popularity of multimedia applications,
has been a major factor for this, especially in the case of image data. Image data is used for several
applications such as classification, retrieval, object recognition, and annotation. Often, utilizing
the entire feature set for each of these activities can be not only be time consuming but can also
Given the large number of features, it is difficult to find the subset of features that is useful for a
given task. Genetic Algorithms (GA) can be used to alleviate this problem, by searching the entire
feature set, for those features that are not only essential but improve performance as well. In this
project, we explore the various approaches to use GA to select features for different applications,
and develop a solution that uses a reduced feature set (selected by GA) to classify images based
on their domain/genre.
The increased interest in Machine Learning applications has led to the design and development of
Random Forest (RF), Support Vector Machine (SVM), and Neural Networks (NN), and perform
10-fold cross-validation with all 3 methods. The idea is to evaluate the performance of each
classifier with the reduced feature set and analyze the impact of feature selection on the accuracy
of the model. It is observed that the RF is insensitive to feature selection, while SVM and NN
i
The use of this solution is demonstrated in image retrieval, and a possible application in image
ii
ACKNOWLEDGEMENTS
I would like to thank my project advisor Dr. Sami Khuri, for his continuous support and
committee members, Dr. Katerina Potika and Mr. Kevin Smith for their time and support.
Special thanks to Dr. Natalia Khuri for her valuable inputs and guidance throughout the
iii
TABLE OF CONTENTS
CHAPTER 1 .................................................................................................................................. 1
Introduction ............................................................................................................................... 1
CHAPTER 2 .................................................................................................................................. 4
CHAPTER 3 ................................................................................................................................ 24
CHAPTER 4 ................................................................................................................................ 27
CHAPTER 5 ................................................................................................................................ 31
Classification ........................................................................................................................... 31
CHAPTER 6 ................................................................................................................................ 39
6.2 Dataset............................................................................................................................ 41
iv
6.4 Results ............................................................................................................................ 43
CHAPTER 7 ................................................................................................................................ 50
REFERENCES............................................................................................................................ 52
APPENDIX .................................................................................................................................. 58
v
LIST OF FIGURES
Figure 5 3´3´5 histogram in the HSV color space of the sample image in Figure 4 ................... 14
Figure 17 Cross-validation............................................................................................................ 33
Figure 19 (a) Linear model (b) Hierarchical model (c) Non-linear model .......... 35
vi
Figure 23 Application of the proposed classifier in theme-based image retrieval ....................... 40
Figure 25 Art vs Photo: 10-fold ROC curve for RF with Reduced and Full Feature set.............. 44
Figure 26 Art vs Photo: 10-fold ROC curve for SVM with Reduced and Full Feature set .......... 44
Figure 27 Art vs Photo: 10-fold ROC curve for NN with Reduced and Full Feature set ............. 45
Figure 28 Cartoon vs Photo: 10-fold ROC curve for RF with Reduced and Full Feature set ...... 46
Figure 29 Cartoon vs Photo: 10-fold ROC curve for SVM with Reduced and Full Feature set .. 46
Figure 30 Cartoon vs Photo: 10-fold ROC curve for NN with Reduced and Full Feature set ..... 47
Figure 31 (a) Accuracy comparison: Art vs Photo (b) Accuracy comparison: Cartoon vs Photo
............................................................................................................................................... 47
vii
Figure 45 Test Image 14 [44]........................................................................................................ 65
viii
LIST OF TABLES
TABLE VII Performance of the classifiers with the reduced feature set selected by GA............ 59
ix
x
CHAPTER 1
Introduction
A feature is a property or an attribute of data that can be used by algorithms, such as, in the field
of machine learning to obtain useful information from datasets. Every datum in an application has
some features. For a given application, all features that are extracted or subset of them, are used to
With the increase in data available at our disposal, plus tens to hundreds of features available for
different datasets, the complexity of the system increases not only in terms of understanding data,
but in terms of resource utilization and system performance. While the size of the dataset cannot
be controlled, the feature set can be reduced to include only relevant and unique features so that
the overall performance increases and resource utilization decreases [1]. Redundant or irrelevant
features may be of the form of correlated features in which there is dependency between them. The
dependent features may not provide any extra information or have an impact on the output. This
means that eliminating such a feature does not affect the total information content. In some cases,
such features may introduce a bias in the system and thus affect the performance. Given that there
may be N features possible for a dataset, there may be 2N combinations of features to test to find
out which features contribute positively to the outcome of the problem. Evolutionary algorithms
such as Genetic Algorithms (GA), can be used for feature selection, where a subset of features
In the smartphone era, the apps related to capturing or sharing multimedia content have gained
popularity. Given the mass multimedia sharing that takes place on the Internet, it is of no surprise
that there are large troves of image/video/audio data readily available for use. Particularly, images
1
have been used in various applications such as, classification, retrieval, object recognition, and
annotation. The fact that images are complex data is proven by the number of features that can be
extracted from an image to represent it. The features range from the basic pixel colors to the more
complex texture and contour features. It therefore becomes important to make use of feature
selection techniques to select only the necessary features for a given application.
Through this work, we investigate the downside of considering huge number of features, by
implementing a GA-based feature selection solution, and utilizing the same in an application to
classify images based on its genre/domain. Images generally belong to 4 domains: photographs,
paintings, cartoons, and sketches [2]. Identifying an image’s genre not only gives the user an idea
about the type of the image, but also finds applications in digital forensics, spam analysis, image
The aim of this work is to analyze existing work that uses evolutionary algorithms for feature
selection, propose a new GA-based solution for feature subset selection, and apply the proposed
solution to classify photographs, cartoons, and paintings. The motivation behind selecting these
specific classifications is to enable theme-based image retrieval and image tampering detection.
Chapter 2 explains the various features that have been extracted with the help of a sample image
Chapter 3 gives an introduction about GA and the parameters set for the experiments in this
project,
Chapter 4 investigates the methods of feature selection and previous work on using GA for
feature selection,
2
Chapter 5 introduces the concept of classification as a form of supervised learning and explains 3
Chapter 6 highlights possible applications of the proposed solution, references the dataset used
3
CHAPTER 2
Feature extraction is the process of parsing input data, in the form of text, image or audio, to find
out characteristics that can uniquely represent the data. For example, for audio data, possible
features could be sampling rate, pitch, amplitude, duration, etc. Similarly, for image data
considered in this work, some examples of features would be mean color, aspect ratio, etc.
Color models are mathematical models used to represent colors of an image. This representation
is generally a tuple of 3 to 4 values and is independent of devices. Some examples of color models
are RGB, CMY, etc. Color spaces on the other hand, represent the colors that can be visualized,
ii) User-based models: Based on human perception. Represents Hue, Saturation, and
For the solution proposed in this work, only Red, Green, Blue (RGB) and Hue, Saturation, Value
4
RGB model: This is an additive model that is used by TV monitors and computer screens. Red,
Green, and Blue color beams are summed up at the projection screen. All colors that appear on the
screen are a summation of R, G, B. To specify each color, the chromaticity values of each of the 3
primary colors need to be specified. R corresponds to the 700nm band of the spectrum, G
corresponds to the 546nm band, and B corresponds to the 435nm band. The RGB model is
5
HSV model: While the RGB model is convenient to specify colors in terms of language that
hardware/devices would understand, it is difficult for humans to speak in terms of RGB. Humans
are more naturally inclined to specify colors in terms of hue, saturation, and intensity. HSI models
cater to this need. It is used in computer graphics to specify tints, shades, and tones. Unlike the
RGB model, HSI models have cylindrical coordinates. The HSV model (Figure 2 [4]) belongs to
A point in the RGB coordinate space can be transformed to a point in the HSV coordinate space
max (R, G, B)
V=
255
max R, G, B − min(R, G, B)
S =
max(R, G, B)
6
1
[ R − G + (R − B)]
H@ = cos D@ 2
(R − G)G + (R − B)(G − B)
H@ ; B ≤ G
H =
360° − H@ ; B > G
The HSV model is used in this project as it aligns with the human representation of color.
Image features refer to the information collected from images that can uniquely identify the image
or can be used for further processing. Broadly, image features can be classified into general
features and domain-specific features [5]. General features, such as color and texture are applicable
to all image data and do not depend on the application being considered. Domain-specific features
on the other hand, are specific to the application at hand, such as, minutiae in fingerprints. In this
work, general features are explored and used in different applications that require image
classification.
Based on the locality of features, image features can be categorized into [6]:
(i) Local features: Local features are the patterns in images that differ from its immediate
neighborhood. These features are extracted from a patch in the image and are useful in
applications such as object recognition. Some examples of local features are Shape
Invariant Feature Transform (SIFT), Local Binary Pattern (LBP), and Speeded Up
(ii) Global features: Global features represent the whole image. These features are
extracted considering the whole image as one patch/object and are useful in
7
segmentation of objects is available. Some examples of global features are Histogram
Mean
Standard
deviation
Moments
Color Skewness
Histogram
Kurtosis
Contrast
Correlation
Gray Level Co-
occurence
Matrix
Homogenity
Features of an Energy
image
Texture
Mean
Gabor filter
Variance
Local Binary
Pattern
Histogram
Oriented Gradient
Shape
Granulometry
Bilateral
filtering
8
Based on the visual content of the image, features of images can be categorized into (Figure 3):
(i) Color features: Color is the most commonly used image feature that can be recognized
(ii) Texture features: Texture provide information about the color and intensity of the
(iii) Shape features: Shapes are yet another feature that can be detected by humans. It
represents the contour or outline of object in an image. Ideally, the scale, orientation,
and position of objects must not affect the features that are extracted based on shape.
9
TABLE I Summary of features extracted
Feature Count
Mean of HSV 3
Skewness of HSV 3
Kurtosis of HSV 3
GLCM - contrast 4
GLCM - correlation 4
GLCM - homogeneity 4
Texture
GLCM - Energy 4
Granulometry 20
984
Figure 4 is a sample image from the category ‘Cartoons’ of the PACS dataset [2] that will be used
10
Figure 4 Sample image to demonstrate feature extraction
1. Color moments: Color moments are analogous to central moments and are used to
characterize the distribution of colors in an image. They are used to compare the similarity
between image. The lower the difference between the color moments of two images, the
MeanO = pO Q
ST@
MeanU = pU Q
ST@
MeanV = pV Q
ST@
11
2nd moment – Standard Deviation: It is the square root of the variance which is a measure
R
1
σO = (pOQ − MeanO )G
N
ST@
R
1
σU = (pUQ − MeanU )G
N
ST@
R
1
σV = (pVQ − MeanV )G
N
ST@
3rd moment – Skewness: It gives a measure of the shape of the color distribution [8].
R
Z
1
sO = (pOQ − MeanO )Y
N
ST@
R
Z
1
sU = (pUQ − MeanU )Y
N
ST@
R
Z
1
sV = (pVQ − MeanV )Y
N
ST@
4th moment – Kurtosis: It gives a measure of the shape of the distribution in terms of height.
R
]
1
kO = (pOQ − MeanO )\
N
ST@
12
R
]
1
kU = (pUQ − MeanU )\
N
ST@
R
]
1
kV = (pVQ − MeanV )\
N
ST@
Since each moment is computed for each of the 3 channels – H, S, V, there are a total of
H S V
2. Color histogram: A color histogram represents the distribution of colors in an image. It can
be visualized either as a distribution of each channel or as a bar chart depicting the number
of pixels of a color/channel. Figure 5 represents the 3´3´5 histogram of the image in Figure
4. This set of histograms is generated by creating 3 bins for H channel consisting of 180
values, 3 bins for S channel consisting of 256 values, and 5 bins for V channel consisting
13
Figure 5 3´3´5 histogram in the HSV color space of the sample image in Figure 4
1. Gray Level Co-Occurrence Matrix (GLCM): GLCM was introduced by Haralick for
classifying rocks into 6 categories [9]. GLCM is a N´N matrix that is computed for a gray
scale image containing N gray levels. The element (i,j) in the GLCM indicates the number
in the horizontal, vertical, left diagonal, and right diagonal directions. In terms of angles,
these directions translate to 0°, 45°, 90°, and 135° (Figure 6 [10]). 14 statistical features
14
Figure 6 GLCM adjacency directions
Out of the 14, four features have been extracted for this project.
cdedcfD@
cdedcfD@
pS,a
Homogeneity =
1 + (i − j)G
S,aTg
cdedcfD@
Energy = pGS,a
S,aTg
cdedcfD@
(i − µS )(j − µa )
Correlation = pS,a
S,aTg (σGS )(σGa )
15
TABLE III Haralick features of the sample image in Figure 4
2. Local Binary Pattern (LBP): While GLCM computes global texture features, LBP
computes local texture features. Like GLCM, this technique is applied on gray scale
images. Considering each pixel as the center, a LBP value is computed and stored in an
array that is the same size as the original image. For each center pixel, a radius r is set and
n number of points are sampled. If the neighbor selected in the n points has an intensity
value less than the center, then it is set to 0, otherwise it is set to 1. Considering all the ones
and zeros in a consistent order, the binary number thus formed is converted to decimal.
This value then becomes the LBP value of the center pixel. This process is illustrated in
Figure 7.
The final step is to compute the histogram of LBP values (Figure 8). For the sample image
16
Figure 8 LBP histogram of the sample image in Figure 4
3. Gabor filter: Gabor filters are bandpass filters that are used to extract texture from images.
A Gabor kernel of certain size is passed over the image such that it detects edges or textures
of given frequency and orientation. A Gabor filter bank is constructed for various
combinations of frequency and orientation. The key parameters of this filter are kernel size,
sigma (the standard deviation of the Gaussian function), theta (orientation of the normal),
lambda (wavelength of the sinusoidal function). Statistical features such as moments of the
distribution, median, and entropy can be extracted from the output of the convolution of
the image with the Gabor filter. Figure 9 demonstrates the application of a Gabor filter on
17
Figure 9 Gabor filter applied on the sample image in Figure 4
For this project, 32 9´9 Gabor filters were created, giving 32 mean values and 32 variance
18
TABLE IV (Mean, variance) extracted for each Gabor kernel
(in radians)
Frequency
(in pixels)
19
2.2.3 Shape features
1. Histogram of Oriented Gradients (HOG): HOG is feature used to detect objects in images.
It counts the number of times a gradient orientation occurs in a patch of the image [11].
Plotting the HOG image roughly highlights the outline of the object in the image. The idea
behind this feature is that the shape of an object can be represented by its edge direction.
The image is split into cells (2´2 patches in this project). Each cell would have pixels
within them (16´16 pixels per cell in this project). The horizontal and vertical gradients of
the image are then calculated using the Sobel operator. X-gradient highlights vertical edges
and Y-gradient highlights horizontal edges. The HOG is computed for every cell. Although
consists of 9 bins representing the angles 0, 20, …, 160. Figure 10 shows an example of
how HOG is computed. The HOG computed for each cell is summed up to produce the
HOG for the image. Figure 11 shows the HOG image for Figure 4.
20
Figure 11 HOG image of the sample image in Figure 4
2. Bilateral filtering: A bilateral filter, when applied on an image, reduces noise and enhances
the edges in the image. This is done by replacing a pixel’s intensity with the average
intensity of all its neighbors. While this is the functionality of a Gaussian filter, bilateral
filter ensures that the edges are preserved [12]. Bilateral filter is useful in differentiating
cartoons that have prominent edges from photographs that do not have well-defined edges.
A feature that can be extracted from this filter is the mean difference between the original
image and the image with bilateral filter applied. The idea is that images like cartoons will
have minimal difference while photographs will have a large mean difference [13].
21
Figure 12 Bilateral filtering applied to the sample image in Figure 4
an image to try and capture similar shapes that may be present in the image. Typical shapes
for an SE are square, rectangle, and cross. The size of the SE is varied to find objects of a
shape based on size. In this project, a disk SE is used as cartoons are more likely to have
curved/soft edged objects [4]. Morphological opening is used along with a disk SE in this
project. Morphological opening is an operation used to remove noise from images and find
specific shapes. Morphological opening is repeatedly applied on the image with different
recorded. Figure 13 represents the granulometry distribution of the sample image in Figure
4. From the distribution, we can learn that cartoons have lesser number of small sized
22
Figure 13 Particle size distribution of the sample image in Figure 4
The pattern spectrum can be derived from the granulometry distribution. It gives an estimate of
23
CHAPTER 3
Genetic Algorithms
Evolutionary computation was developed with the idea that it could be used as a tool for
optimization and solutions to problems could be evolved using operators of natural selection. Early
randomly changing the state diagrams. John Holland invented Genetic Algorithms (GA) which
was a population-based algorithm. The goal was to study the process of evolution and design a
A GA is a heuristic search algorithm based on the concepts of natural selection and genetics. The
idea is to mimic biological processes such as survival of the fittest, to evolve a solution for a
selection along with operations such as crossover and mutation [14]. Each chromosome consists
of genes. Selection operators choose individuals from the population that are the fittest, while
crossover and mutation mimic biological processes responsible for introducing diversity to the
population. While selection is an exploitation process, crossover and mutation are exploration
processes.
Evolutionary algorithms are most suitable for problems that involve a large search space i.e. many
possible solutions. Other problems require that new solutions are produced at each stage, to explore
new options or they involve complex solutions, that can be processed by hand [14]. GAs, like the
organism/solution is determined based on the problem at hand, and it is a factor which continuously
24
• a population of individuals/chromosomes – each chromosome is a possible solution to the
problem at hand. The population is modified or replaced over n iterations of the algorithm.
• fitness function – each chromosome is assigned a fitness value/score which indicates how
close the solution represented by the chromosome is, as compared to the expected result.
• selection criteria – the fitter the chromosome, the higher the chance it has of being selected.
• mutation operator – to create a new chromosome, random bits in the chromosome are
flipped.
Encoding – binary string (1 represents that the feature has been selected; 0 represents that the
Size of population – 50
25
Figure 15 Simple GA
Some examples of applications that use GAs, are: optimization tasks, machine learning, economic
In the next chapter, we will introduce various feature selection techniques and investigate the use
26
CHAPTER 4
Feature Selection
In real-world applications, data is collected to the granular level. This has been carried out over
many years in the belief that more data means more useful information for processing. With the
increase in number of devices worldwide, there has been a surge in the availability of data in a
way that storing, handling, and processing data has become difficult. Additionally, the data
collected is most often not pre-processed, and hence contains redundant and irrelevant data.
Dimensionality reduction techniques have been adopted to reduce the vast dimensions of data to
smaller dimensions [16]. The most popular dimensionality reduction techniques combine features
to reduce the dimension. Feature selection is one such dimensionality reduction which selects
i. Filter method: - This method involves ranking features using suitable criteria such
that the highly-ranked features are picked for application [1]. The idea is to filter out lower ranked
features. The most important factor in this method is determining the rank or relevance of a feature.
cov(x* , y)
R i =
var x* ∗ var(y)
b. Mutual information: - This measure is used to measure the dependency between features.
27
The advantages of this method are that it is simple to compute and that it doesn’t rely on learning
algorithms. The drawback is that the features selected may not be guaranteed to be non-redundant
[1].
ii. Wrapper method: - This method depends on use of classification to determine a feature
subset. Exhaustive search methods may be able to arrive at the most optimal result but they
can be computationally intensive for large datasets. Therefore, 2 types of wrapper methods
starts with an empty set and adds features as and when they qualify. Sequential
Backward Search algorithms start with the entire feature set and progressively
Although this proves to be a convenient method for the selecting features, the main
drawback is that the entire model must be built and evaluated for each feature subset
considered.
iii. Embedded methods: - This method tries to compensate for the drawbacks of filter and
wrapper methods. It involves algorithms that have in-built feature selection methods. This
combines the step of selecting features and determining performance into one step [1].
In this project, given that the dataset contains images, the number of features under consideration
are so huge that the overhead of building and testing the model for each iteration becomes
28
unimportant as compared to selecting a smaller subset of features than originally available.
For feature selection using GA, the most natural and widely used chromosome encoding is the
binary string encoding [17, 18, 19, 20, 21, 22, 23]. In this, the chromosome is represented as a bit
string in which 1 represents if the feature is selected and 0 otherwise. Some specific
implementations used representations which included weighted feature vectors and specific
classification model parameters, along with binary string encoding [24, 25]. While the most
common crossover and mutation operators are 2-point and random mutation respectively [21, 24],
some implementations make use of adaptive crossover and mutation [20, 26], where the probability
of crossover and mutation are learnt from iterations. While some implementations used variations
of Elitist selection [19, 20, 24, 27] or tournament [18], Roulette wheel seemed to be the most
popular selection method [17, 22, 23]. To improve the results, local improvements were used in
some cases where low performing features were replaced by high performing features [22]. Fitness
functions are application dependent. TABLE V summarizes previous research work done for
The next chapter explains the concept of classification and three different classification models.
29
TABLE V Summary of GA Parameters used for Feature Selection
Data
Encoding Fitness function Selection Crossover Mutation
type
Weighted feature
vector + binary
encoding + Image
Elitist [19, 20, Change 4 bits
application- specific Classification accuracy [24, 25] 2-point [24]
24, 27] [24]
parameter [24, 25, 28]
30
CHAPTER 5
Classification
An Artificial Intelligence (AI) agent is designed and programmed to make decisions on certain
tasks based on its learning from data. An agent’s learning is essential for the following reasons
[29]:
iii. programmers may not know how to program the agent for a specific situation
Learning can be classified into 2 types based on the order of learning [29]:
Learning can also be classified into 2 other types based on the types of feedback [29]:
ii. Reinforcement learning – the agent learns from positive or negative feedback from the
previous learning.
iii. Supervised learning – the agent learning from input-output pairs. Example: Classification.
iv. Semi-supervised learning – the agent learns about new unlabeled examples based on data
31
5.1 Supervised learning
Given a training set of N (xi, yi) input-output pairs, the task is to learn by searching for a possible
hypotheses (h) that will perform well even on new input-output pairs (Figure 16). The performance
of a hypotheses is measured in terms of accuracy in correctly predicting yj for xj, where (xj, yj)
Classification is a supervised learning problem in which y is a finite set of values. If y can take
only 2 values, then the classification is called binary classification. Regression is a supervised
5.2 Cross-validation
Classifiers need to perform well on previously seen as well as new data. To verify this, validation
is performed on classification models. Validation refers to the process of testing the model using
combinations of training and test data and consolidating the results [30]. Generating different
combinations for validation is a challenging task. Cross-validation is one approach that generates
these combinations by making use of partially seen and unseen data (Figure 17 [30]).
32
(i) n-fold cross-validation: Here, the data set is split into n equal parts such that the
percentage of samples of each class is maintained in each fold. The most commonly
used value for n, is 10. In this case, the dataset is split into 10 equal parts. In the first
iteration, the 10th fold is used for testing and the others for training. In the second
iteration, the 9th fold is used for testing and the others for training. The process is
repeated for other folds. Each iteration of validation produces a classification accuracy,
Figure 17 Cross-validation
33
(ii) Leave-one-out cross validation: Suppose there are n entries in the data set, this method
considers n-1 entries for training and the last 1 entry as testing data. The validation
process is repeated n times by leaving one sample out each time for use as test data.
The accuracy is calculated for each iteration and the average accuracy of all iterations
(iii) Random sampling: In this method, first, k integers pi (less than n) are randomly
generated. Then, the original data set is shuffled k times to generate k different datasets
Si {i=1,...,k}. Partition each Si into training and validation sets, such that there are pi
samples in the training set and pn-i samples in the validation set.
Given a dataset with 2 or more categories or classes, classification models are mathematical
models that can predict the category of new data based on information from existing data. For
The task of the classification models is to separate these 2 classes with a clear boundary
differentiating green from orange. Depending on the type of the classification model – linear,
hierarchical, or non-linear, the split between the classes may look like Figure 19.
34
Figure 19 (a) Linear model (b) Hierarchical model (c) Non-linear model
This mathematical model makes use of all the data points in the domain and therefore, it is required
that all data are available beforehand. The idea of this model is to place a line (hyperplane)
y = wx % + γ
in the domain and adjust it in such a way that the classification accuracy is maximized. When there
are multiple classes in the dataset, multiple lines are placed in the domain and adjusted to identify
multiple classes. SVMs may include a kernel when the data is not linearly separable. SVMs also
have the ability to give importance to certain features or sample, thus improving performance.
(i) Linear: The data is expected to be separated by a gap, such that a linear hyperplane can
separate them. The goal is to maximize the distance between the hyperplane and the
nearest data point, which is called margin. Figure 20 (a) is an example of the partition
in linear SVM.
35
Figure 20 (a) Linear SVM (b) Non-linear SVM
(ii) Non-linear: This type of SVM is used when the data cannot be separated linearly. Non-
linear kernels such as homogeneous kernel, non-homogeneous kernel, and Radial Basis
Function kernel, are used. The idea is to find linear separations in higher-dimensional
While SVMs are memory efficient and useful when the data has a large dimension, they are
computationally slow.
RFs are forests of decision trees generated using random sampling, which can be used for both,
classification and regression problems [30]. While decision trees comprise of only one tree for
testing, RFs comprise of multiple decision trees in the testing phase, thus making it a better option
as compared to decision trees. Since, RFs are a group of different decision trees (Figure 21), they
are also called an ensemble method. This method groups together multiple classifiers to form a
strong classifier and improves performance using divide-and-conquer. N samples of data are
sampled and a decision tree is created for each subset of data. When test data is fed into the RF, it
is tested on all trees and the result is either an average of all values (in the case of regression) or
36
the majority value (in the case of classification). While RFs are fast in execution, they may lead to
data overfitting.
Neural networks are models that are designed to mimic the working of the human brain. Multiple
neurons work together to learn new information. Information is stored in the form of weights [31].
Ax = B
where A is the input, B is the outcome, and x are the weights in the network.
The most basic type of neural network is the perceptron. In its simplest form, the network is
represented by the equation y+ = x+ w+ . The output of this equation is then converted to values 0 or
37
1 using an activation function, thus performing a binary classification. Learning is done by
adjusting the weights until all the data points in the input dataset are correctly classified. While the
simple single-layer perceptron was effective on linearly separable datasets, the performance was
The multi-layer feed-forward neural network or multi-layer perceptron (MLP) is the most popular
neural network (Figure 22 [31]). It has input values xi, one or more hidden layers, and an output
layer [31]. While the general architecture of the MLP is like that of the simple perceptron, MLPs
have different activation functions that suit the application at hand. This weights in this type of
network are trained using backpropagation. Higher the weight, the tighter the correlation between
In this project, all the 3 classifiers have been used to compare their performance for the given
dataset.
38
CHAPTER 6
6.1 Applications
While there are numerous applications of image classification, ranging from a simple
differentiation between a cat and a dog, to a more complex application of image spam analysis,
the solution developed in this project focuses on classifying images as photographs or cartoons.
Previous work on classifying images of different domains has primarily focused on classifying
images as computer-generated graphics or camera-captured photographs [32, 33, 34] in the context
of digital forensics and watermarking. An attempt to use GAs to select features for the
increased accuracy while reducing the number of features from 234 to 100 [35]. Citing the
classification, identified 148 features that could successfully classify cartoons and photographs
[4]. Most image genre classification research make use of either SVMs or neural networks for
The solution developed in this project focuses on classifying photographs and cartoons to retrieve
images based on a theme/genre. This is particularly useful in the context of a text to picture
conversion system wherein images are retrieved based on information from the text and displayed
in a manner that increases the user’s comprehension [36]. By including the advantages of the
proposed classifier into such a system, illustrations of a theme can be generated, targeting a specific
39
audience. For example, if the text to picture conversion system is used to convert medical
instructions meant for kids, to illustrations, then the proposed classifier could select only those
images that are cartoons. Figure 23 illustrates how the proposed classification solution could be
With the advances made in camera technology, photorealism has become both a boon and a bane.
While making quality photography accessible to the common man and not limiting advanced
various photo-editing software, have led to the increase in image tampering [32]. Images are
tampered with for various reasons ranging from forgery for monetary benefit to reuse without
permission. Without an automated system in place, it requires either the author or an expert to
40
manually identify if an image is fake or not [37]. For example, to check if a painting has been
tampered with, experts check the colors of the painting presented, to verify if the colors were
available around the time when the painting was made. As a first pass or filter, image classification
can be used to find out the genuineness of an image by classifying paintings and photographs [37]
[38], and computer generated images and photographs [32, 33, 34]. In this project, we make use
Given that this project focuses on classifying images into cartoons, photos, and art paintings, the
solution can be trained to classify other domains such as graphics and sketches as well. This is
possible due to the features extracted from the images that cover majority of the feature types of
an image.
6.2 Dataset
The dataset, PACS (Photo, Art Painting, Cartoon, Sketch) [2], used for the experiments in this
project, consists of 9991 images in the 4 domains of photographs, paintings, cartoons, and sketches
(Figure 24 [2]). Images belong to various categories such as ‘dog’, ‘elephant’, ‘giraffe’, ‘guitar’,
‘horse’, ‘house’, ‘person’. Given that the focus of this project is to classify photographs from
cartoons or paintings, only the domains photographs, cartoons and art paintings, consisting of 6062
Originally, PACS dataset was used to perform domain generalization [2], in which, images from
photograph, painting, and cartoon domains are used for training a model that can then recognize a
41
Figure 24 PACS dataset used for domain generalization
All experiments were run on a PC with 2.7 GHz Intel Core i5 running MacOS v10.12.6. Source
code was written using Python v2.7.10. The following Python and R (v3.4.3) libraries were used
in the implementation:
42
6.4 Results
Binary classification is performed – one for the classification of ‘photo’ vs ‘cartoon’ and another
10-fold cross validation was performed multiple times with different splits, different classification
model and with both, the reduced and the full feature sets. For each combination of data, the
majority over all the iterations is considered as the resultant label. Receiver Operating
Characteristic (ROC) curves are then generated for the reduced and full feature sets with different
classification models. The ROC curves in Figure 25-Figure 30 are representative of one iteration
Training set size: 3346 images – 1503 photos, 1843 art paintings
Validation set size: 372 images – 167 photos, 205 art paintings
Random Forest:
Figure 25 shows the performance of RF for classifying art from photos. The accuracy of
classification with Random Forest is ~86% irrespective of using reduced or full feature set.
43
Figure 25 Art vs Photo: 10-fold ROC curve for RF with Reduced and Full Feature set
SVM:
Figure 26 shows the performance of SVM for classifying art from photos. The accuracy of
classification with SVM is ~75% with the reduced feature set of 492 features and ~58% with the
Figure 26 Art vs Photo: 10-fold ROC curve for SVM with Reduced and Full Feature set
NN:
44
Figure 27 shows the performance of NN for classifying art from photos. The accuracy of
classification with NN is ~%74 with the reduced feature set of 485 features and ~58% with the
Figure 27 Art vs Photo: 10-fold ROC curve for NN with Reduced and Full Feature set
Random Forest:
Figure 28 shows the performance of RF for classifying cartoons from photos. The accuracy of
classification with Random Forest is 95% irrespective of using reduced or full feature set.
45
Figure 28 Cartoon vs Photo: 10-fold ROC curve for RF with Reduced and Full Feature set
SVM:
Figure 29 shows the performance of SVM for classifying cartoons from photos. The accuracy of
classification with SVM is ~90% with the reduced feature set of 485 features and ~81% with the
Figure 29 Cartoon vs Photo: 10-fold ROC curve for SVM with Reduced and Full Feature set
NN:
46
Figure 30 shows the performance of NN for classifying cartoons from photos. The accuracy of
classification with NN is ~90% with the reduced feature set of 485 features and ~73% with the full
feature set.
Figure 30 Cartoon vs Photo: 10-fold ROC curve for NN with Reduced and Full Feature set
Figure 31 (a) summarizes the min, mean, and max accuracy of multiple iterations of classifying
art vs photo with various classification models with reduced and full feature set, while Figure 31
Figure 31 (a) Accuracy comparison: Art vs Photo (b) Accuracy comparison: Cartoon vs Photo
To confirm that the improvement in accuracy with the reduced feature set is statistically
significant, we make use of the T-test. Statistical significance refers to the level of confidence with
47
which the accuracy results can be replicated. T-tests are used to compare means of two groups and
indicate whether they are different from each other. A T-value trending towards 0, implies that the
groups are similar. A P-value is the probability that the accuracy results occurred by chance. The
lower the P-value, the lesser this chance. P-value 0.05 means that there is 5% chance that the results
occurred by chance. This is also called the 95% confidence interval. TABLE VI summarizes the
results of the T-test comparing the accuracy obtained with different classification models with the
reduced vs full feature set. According to the T-test, there is no statistical difference between the
reduced and full feature set in the case of RF. This can be concluded from the fact that P-value >
0.05 and the confidence interval contains the value 0. Also, this supports the earlier observation
that the accuracies were the same for RF. On the other hand, the difference in accuracies in the
case of SVM and NN are supported by the statistical difference in the T-test for the same.
48
TABLE VI Comparing accuracies using the T-test
Art-Photo Cartoon-Photo
T-Value P-Value 95% confidence interval T-Value P-Value 95% confidence interval
49
CHAPTER 7
In this project, we classify cartoons from photos for genre-based image retrieval and classify art
paintings from photos to detect image tampering. Since cartoons are better identified using color
and shape properties, 878 color and shape features were extracted from each of the 6062 images
in the training data set. Along the same lines, since texture of surfaces are more pronounced in art
paintings, 106 texture features were extracted. In total, 984 image features were extracted to
To demonstrate that feature selection not only improves execution time, but also improves
classification accuracy, we utilized a GA to select feature subsets from the entire set of features,
considering the classifier accuracy as the fitness function. Utilizing this reduced feature set for
image classification, the results showed that feature selection improves the accuracy of
classification in the case of SVM and Neural Networks, while not making an impact on the
classification accuracy in the case of Random Forests. This was expected, as the performance of
In terms of the appropriateness of the features extracted from the images, the texture-based features
helped correctly classify art paintings from the other categories. While color and shape features
helped classify cartoons, the classifiers often misclassified cartoon headshots as photos and photos
From our results, we can conclude that GA has a positive impact on the performance of
classification, and SVM along with feature selection, performs the best for the classification
considered.
50
As an extension of this work, we can reuse the solution proposed in this project in other scenarios,
sketch version of an image from other genres/domains. We can also focus on improving the set of
features extracted from images to cater to a wide variety of classification applications. A study
comparing GA-based feature selection with other wrapper methods is also planned.
51
REFERENCES
[1] G. Chandrashekar and F. Sahin, "A survey on feature selection methods," Computers and
[2] D. Li, Y. Yang, Y.-Z. Song and T. Hospedales, "Deeper, Broader and Artier Domain
[4] T. I. Ianeva, A. P. de Vries and H. Rohrig, "Detecting cartoons: a case study in automatic
2003.
[5] K. Dittakan, F. Coenen, R. Christley and M. Wardeh, "A Comparative Study of Three
Image Representations for Population Estimation Mining Using Remote Sensing Imagery,"
2013.
"Combining Local and Global Image Features for Object Class Recognition," in IEEE
Computer Society Conference on Computer Vision and Pattern Recognition, San Diego,
CA, 2005.
[7] S. Kodituwakku and S. S, "Comparison of Color Features for Image Retrieval," Indian
52
[8] Wikipedia contributors, "Color moments - Wikipedia," 4 February 2012. [Online].
[10] B. Pathak and D. Barooah, "Texture Analysis Based on the Gray-level Co-occurrence
[11] N. Dalal and B. Triggs, "Histograms of oriented gradients for human detection," in IEEE
Computer Society Conference on Computer Vision and Pattern Recognition, San Diego,
CA, 2005.
[12] S. Paris, P. Kornprobst, J. Tumblin and F. Durand, "A gentle introduction to bilateral
[14] M. Mitchell, An Introduction to Genetic Algorithms, Cambridge, MA: MIT Press, 1998.
Massive Datasets, New York, Cambridge University Press, 2014, pp. 415-447.
[17] F. Gómez and A. Quesada, "Genetic algorithms for feature selection in Data Analytics," 01
https://www.neuraldesigner.com/blog/genetic_algorithms_for_feature_selection. [Accessed
05 09 2017].
53
[18] E. Sivasankar and R. S. Rajesh, "Design and development of efficient feature Selection and
2012.
[19] P. Kushwaha and R. Welekar, "Feature Selection for Image Retrieval based on Genetic
[20] H. Huang, Y. Wu, Y. Chan and C. Lin, "Study on image feature selection: A genetic
[21] C. H. Lin, H. Y. Chen and Y. S. Wu, "Study of image retrieval and classification based on
adaptive features using genetic algorithm feature selection," Expert Systems with
[22] I. S. Oh, J. S. Lee and B. R. Moon, "Hybrid genetic algorithms for feature selection," IEEE
normal breast tissue," Medical Physics, vol. 23, no. 10, pp. 1671-1684, 1996.
[24] J. Lu, T. Zhao and Y. Zhang, "Feature selection based-on genetic algorithm for image
2000.
54
[26] L. Liang, J. Peng and B. Yang, "Image Feature Selection Based on Genetic Algorithm," in
[27] F. Catak, "Genetic Algorithm based Feature Selection in High Dimensional Text Dataset
Classification," WSEAS Transactions on Information Sciences and Application, vol. 12, no.
Barman, "Genetic Algorithm Based Feature Selection Combined with Dual Classification
[29] P. Norvig and S. Russell, Artificial Intelligence: A Modern Approach, 3rd Edition ed.,
[30] S. Suthaharan, Machine Learning Models and Algorithms for Big Data Classification,
[31] J. Patterson and A. Gibson, Deep Learning A Practitioner’s Approach, O’Reilly Media,
Inc., 2017.
[32] T.-T. Ng and S.-F. Chang, "An online system for classifying computer graphics images
Contents, 2006.
55
[34] Z. Li, J. Ye and Y. Q. Shi, "Distinguishing computer graphics from photographic images
using local binary patterns," in 11th international conference on Digital Forensics and
Watermaking, 2012.
[35] W. Chen, Y. Q. Shi, G. Xuan and W. Su, "Computer graphics identification using genetic
https://github.com/vandanavk/text2collage.
[37] R. Nemade, A. Nitsure, P. Hirve and S. B. Mane, "Detection of Forgery in Art Paintings
network.org/challenges/VOC/voc2007/workshop/index.html.
https://www.kaggle.com/dollarakshay/pokemon-images/data.
[41] A. Mishra, S. Nandan Rai, A. Mishra and C. Jawahar, "IIIT-CFW: A Benchmark Database
of Cartoon Faces in the Wild," in 1st Workshop on Visual Analysis and Sketch (ECCVW),
2016.
[42] BagoGames, "New Looney Tunes Movies By X-Men Writers," 28 August 2014. [Online].
Available: https://www.flickr.com/photos/bagogames/15036256406.
56
[43] C. Bircanoğlu, "Comic Books Images," 2017. [Online]. Available:
https://www.kaggle.com/cenkbircanoglu/comic-books-classification/data.
https://www.kaggle.com/c/painter-by-numbers/data.
57
APPENDIX
5 images each (Figure 32-Figure 46), from the categories - photographs, cartoons, and art paintings
were retrieved from different sources for testing and the performance of the classification models
were evaluated, with the reduced feature set selected by GA. On an average, the GA selects about
485 features out of the 984 features that are extracted for each image. TABLE VII summarizes this
performance.
58
TABLE VII Performance of the classifiers with the reduced feature set selected by GA
59
Image True Label Classification by RF Classification by SVM Classification by NN
60
Image True Label Classification by RF Classification by SVM Classification by NN
61
Image True Label Classification by RF Classification by SVM Classification by NN
62
Image True Label Classification by RF Classification by SVM Classification by NN
63
Image True Label Classification by RF Classification by SVM Classification by NN
64
Image True Label Classification by RF Classification by SVM Classification by NN
65
66