A Review of Multi-Class Classification Algorithms
A Review of Multi-Class Classification Algorithms
Abstract: Classification is one of the crucial tasks of data engineering, medicine, crime analysis, expert prediction,
mining, and many machine learning algorithms are web mining and mobile computing besides others utilize
inherently designed for binary decision problems. data mining
Classification is a complex process that may be affected
by many factors. This paper examines current practices, Data mining involves six common classes of tasks:
problems, and prospects of multi-class classification. In Anomaly detection (outlier/change/deviation
several application domains such as biology, computer detection) – The identification of unusual data
vision, social network analysis and information retrieval, records, that might be interesting or data errors that
multi-class classification problems arise in which data require further investigation.
instances not simply belong to one particular class, but Association rule learning (dependency modelling) –
exhibit a partial membership to several classes. The Searches for relationships between variables. For
emphasis is placed on the summarization of major example,a supermarket might gather data on
advanced classification approaches and the techniques customer purchasing habits. Using association rule
used for improving classification accuracy in multi class learning, the supermarket can determine which
classification for different datasets. products are frequently bought together and use this
information for marketing purposes. This is
Keywords: Data Mining, Multi class Classification sometimes referred to as market basket analysis.
Clustering – is the task of discovering groups and
1. Introduction
structures in the data that are in some way or another
"similar", without using known structures in the
Data mining is the computing process of discovering
data.
patterns in large data sets involving methods at the
intersection of machine learning, statistics, and database Classification – is the task of generalizing known
systems. It is an essential process where intelligent structure to apply to new data. For example, an e-
methods are applied to extract data patterns. It is an mail program might attempt to classify an e-mail as
interdisciplinary subfield of computer science. The "legitimate" or as "spam".
overall goal of the data mining process is to extract Regression – attempts to find a function which
information from a data set and transform it into an models the data with the least error that is, for
understandable structure for further use. estimating the relationships among data or datasets.
Summarization – providing a more compact
Data mining is a collection of techniques for efficient representation of the data set, including
automated discovery of previously unknown, valid, visualization and report generation.
novel, useful and understandable patterns in large
databases. Conventionally, the information that is mined In this paper we are concentrating on classification of
is denoted as a model of the semantic structure of the data. Classification is one of the fundamental and very
datasets. The model might be utilized for prediction and important task of data mining and machine learning field.
categorization of new data. In recent years the sizes of Databases are rich with hidden information, which can
databases has increased rapidly. This has lead to a be used for intelligent decision making. Classification
growing interest in the development of tools capable in and prediction are two forms of data analysis that can be
the automatic extraction of knowledge from data. The used to extract models describing important data classes
term Data Mining or Knowledge Discovery in databases or to predict future data trends. Classification is a data
has been adopted for a field of research dealing with the mining (machine learning) technique used to predict
automatic discovery of implicit information or group membership for data instances. Machine learning
knowledge within databases [16]. Diverse fields such as refers to a system that has the capability to automatically
marketing, customer relationship management,
17
International Journal of Pure and Applied Mathematics Special Issue
learn knowledge from experience and other ways Supervised learning problems can be further grouped into
Classification predicts categorical labels whereas regression and classification problems.
prediction models continuous valued functions.
Classification is the task of generalizing known structure Classification: A classification problem is when
to apply to new data while clustering is the task of the output variable is a category, such as “red” or
discovering groups and structures in the data that are in “blue” or “disease” and “no disease”.
some way or another similar, without using known Regression: A regression problem is when the
structures in the data. output.
In machine learning, the problem of classification is variable is a real value, such as “dollars” or “weight”.
encountered in various areas, such as medicine to Some popular examples of supervised machine learning
algorithms are:
identify a disease of a patient, or industry to decide
whether a defect has appeared or not, or to decide the
temperature is low, middle or high. Linear regression for regression problems.
Random forest for classification and regression
Classification is divided into two categories. problems.
Support vector machines for classification
problems.
Unsupervised:
Semi-supervised:
All data is labeled and the algorithms learn to predict the
output from the input data.
Some data is labeled but most of it is unlabeled and a
mixture of supervised and unsupervised techniques can
Supervised learning is where you have input variables (x)
be used.
and an output variable (Y) and you use an algorithm to
learn the mapping function from the input to the output.
Semi supervised learning refers to the use of both labeled
Y = f(X)
and unlabeled data for training. Many machine-learning
18
International Journal of Pure and Applied Mathematics Special Issue
researchers have found that unlabeled data, when used in Extension from binary
conjunction with a small amount of labeled data, can
produce considerable improvement in learning accuracy. Strategies of extending the existing binary classifiers to
solve multi-class classification problems. Several
Labelled instances however are often difficult, expensive,
algorithms have been developed based on
or time consuming to obtain, as they require the efforts of
neural networks, decision trees, neighbours, naive
experienced human annotators. Meanwhile unlabeled Bayes, support vector machines and extreme learning
data may be relatively easy to collect, but there has been machines to address multi-class classification problems.
few ways to use them. Semi-supervised learning
addresses this problem by using large amount of 2.1 Hierarchical classification
unlabeled data, together with the labelled data, to build
better classifiers. Because semi-supervised learning Hierarchical classification methods differ in a number of
requires less human effort and gives higher accuracy. criteria. The first criterion is the type of hierarchical
structure used. This structure is based on the problem
Some popular examples of semi-supervised methods are- structure and it typically is either a tree or a DAG.
EM with generative mixture models, Self-training, co-
training, transductive support vector machines and graph- We can do the same thing with binary classifiers: If we
based methods. have a large number of classes, we can divide them into
two sets of classes, say A and B. Then we can divide the
2. Methods To Multiclass Classification classes in A into two smaller sets of classes, divide B into
two smaller sets of classes and so on. To run our multi-
There are three groups of methods to solve the multiclass class classification, we would first train a binary classifier
classification problems. to determine whether a new data point is in some class in
A, or in some class in B. We then train a second binary
1. Extended from binary case classifier to determine which of the two subsets of A a
2. Converting the multiclass classification problem point is in, and a third classifier to determine which of the
into several binary classification problems subsets of B a point is in. We continue all the way down,
3. Hierarchical classification methods. until we get to classifiers that distinguish individual
classes. This is called hierarchical classification because
the different steps in the scheme form a sort of hierarchy
from the first question (the CEO) to the second level
questions (the vice-presidents) and down to the final
questions (the mailroom clerks) that distinguish
individual classes.
Transformation to binary
Reducing the problem of multiclass classification to
multiple binary classification problems. It can be
categorized into One vs Rest and One vs One.
19
International Journal of Pure and Applied Mathematics Special Issue
During the testing, samples are classified by finding An Artificial Neural Network (ANN) is an information
margin from the linear separating hyperplane. The final processing paradigm that is inspired by biological nervous
output is the class that corresponds to the SVM with the systems. It is composed of a large number of highly
largest margin. However, if the outputs corresponding to interconnected processing elements called neurons. An
two or more classes are very close to each other, those ANN is configured for a specific application, such as
points are labeled as unclassified. This multiclass method pattern recognition or data classification
has an advantage that the number of binary classifiers to
construct equals the number of classes. However, there Ability to derive meaning from complicated or imprecise
are some drawbacks. First, data extract patterns and detect trends that are too
complex to be noticed by either humans or other
During the training phase, the memory computer techniques Adaptive learning Real Time
requirement is very high and amounts to at the Operation. Neural networks are commonly used for
square of the total number of training samples.
This may cause problems for large training data classification problems and
sets and may lead to computer memory problems. regression problems.
Second, suppose there are K classes and each has
an equal number of training samples. During the A neuron in an artificial neural network is
training phase, the ratio of training samples of one
class to rest of the classes will be 1: (K −1). This 1. A set of input values (xi) and associated weights
(wi).
ratio, therefore, shows that training sample sizes
2. A function (g) that sums the weights and maps the
will be unbalanced.
results to an output (y).
2.2.2 One against One Approach (OAO)
Neurons are organized into layers: input, hidden and
output. The input layer is composed not of full neurons,
In this method, SVM classifiers for all possible pairs of
but rather consists simply of the record's values that are
classes are created. Therefore, for K classes, there will be
20
International Journal of Pure and Applied Mathematics Special Issue
inputs to the next layer of neurons. The next layer is the 3. A neural network learns and does not need to be
hidden layer. Several hidden layers can exist in one neural reprogrammed.
network. The final layer is the output layer, where there is 4. It can be implemented in any application and
one node for each class. A single sweep forward through without any problem.
the network results in the assignment of a value to each 5. High accuracy and noise tolerance.
output node, and the record is assigned to the class node
with the highest value. Disadvantages:
1. The neural network needs training to operate.
2. Requires high processing time for large neural
networks.
3. Lack of transparency
4. Learning time is long
5. Defining classification rule is difficult
21
International Journal of Pure and Applied Mathematics Special Issue
22
International Journal of Pure and Applied Mathematics Special Issue
5. Accuracy is severely degraded by noisy or dimensional space and uses almost all attributes. It
irrelevant function separates the space in a single pass to generate flat and
linear partitions. Divide the 2 categories by a clear gap
2.3.6 Naïve Bayes Classifier that should be as wide as possible. Do this partitioning by
a plane called hyperplane.
The Naive Bayes Classifier technique is based on the so-
called Bayesian theorem and is particularly suited when An SVM creates hyperplanes that have the largest margin
the dimensionality of the inputs is high. Despite its in a high-dimensional space to separate given data into
simplicity, Naive Bayes can often outperform more classes. The margin between the 2 classes represents the
sophisticated classification methods. The Bayesian longest distance between closest data points of those
classification is used as a probabilistic learning method classes. The larger the margin, the lower is the
(Naive Bayes text classification). Naive Bayes classifiers generalization error of the classifier. After training map
are among the most successful known algorithms for the new data to same space to predict which category they
learning to classify text documents. belong to. Categorize the new data into different
partitions and achieve it by training data.
It predicts membership probabilities for each class such
as the probability that given record or data point belongs Advantage
to a particular class. The class with the highest probability 1. Of all the available classifiers, SVM
is considered as the most likely class. This is also known provides the largest flexibility.
as Maximum A Posteriori (MAP).
2. High accuracy, nice theoretical guarantees
regarding overfitting, and with an
Naive Bayes classifier assumes that all the features are
appropriate kernel they can work well even
unrelated to each other. Presence or absence of a feature
does not influence the presence or absence of any other if you're data isn't linearly separable in the
feature. base feature space.
3. SVMs are like probabilistic approaches but
Advantage do not consider dependencies among
1. Easy to implement attributes.
2. Good results obtained in most of the cases Disadvantage
3. It only requires a small amount of training data
to estimate the parameters 1. Picking/finding the right kernel can be a challenge
4. Bayesian classifiers have also exhibited high 2. Results/output are incomprehensible
accuracy and speed when applied to large 3. No standardized way for dealing with multi-class
database. problems; fundamentally a binary classifier
Disadvantage 4. Comparison
1. Assumption, class conditional independence, 4.1 Different algorithm that can extended from
therefore loss of accuracy binary
2. Practically, dependencies among variables
cannot be modelled by naïve Bayesian classifier
3. Conditional independence assumption is
violated by real world data perform very poorly
when features are highly correlated.
23
International Journal of Pure and Applied Mathematics Special Issue
24
International Journal of Pure and Applied Mathematics Special Issue
25
26