Text Classification Using Machine Learning
Methods-A Survey
Basant Agarwal and Namita Mittal
Abstract Text classification is used to organize documents in a predefined set of
classes. It is very useful in Web content management, search engines; email filtering,
etc. Text classification is a difficult task due to high- dimensional feature vector com-
prising noisy and irrelevant features. Various feature reduction methods have been
proposed for eliminating irrelevant features as well as for reducing the dimension
of feature vector. Relevant and reduced feature vector is used by machine learning
model for better classification results. This paper presents various text classification
approaches using machine learning techniques, and feature selection techniques for
reducing the high-dimensional feature vector.
Keywords Text classification · Feature selection · Machine learning Algorithms
1 Introduction
Text mining means to extract relevant information from text and to search for inter-
esting relationships between extracted entities. Text classification is one of the basic
and important tasks of text mining. Text classification means automatically assign a
document in some predefined categories of documents based on their contents. Text
classification is a supervised learning model that can classify text documents accord-
ing to their predefined categories. Web content for a search engine can be organized
properly using text classification for efficient retrieval of Web documents. Text clas-
sification techniques are be used for automatically email filtering, medical diagnosis,
B. Agarwal (B) · N. Mittal
Malaviya National Institute of Technology, Jaipur, India
e-mail:
[email protected]N. Mittal
e-mail:
[email protected]B. V. Babu et al. (eds.), Proceedings of the Second International Conference on Soft Computing 701
for Problem Solving (SocProS 2012), December 28–30, 2012, Advances in Intelligent Systems
and Computing 236, DOI: 10.1007/978-81-322-1602-5_75, © Springer India 2014
702 B. Agarwal and N. Mittal
news group filtering, documents organization, indexing for document retrieval, word
sense disambiguation by detecting the topics a document covers.
Main challenges for text classification are following:
1. High dimensionality, due to which it is difficult to create a classifier model
because performance of the classifier degrades as feature vector increases for a
classifier [1].
2. Not all features are important for classification, some features may be redundant
or irrelevant and some may even misguide the classification result [1].
3. To remove redundancy and noisy features from the data.
In text classification, feature vector generally consist of thousands of
attributes/features, that is why feature reduction methods has to be used for removing
irrelevant features, in such a way that classifier accuracy does not affected. Efficiency
and success of any machine learning algorithm depends on the quality of data. Auto-
matic feature reduction methods are used for reducing the size of feature vector
and removing irrelevant features. There are two methods for this purpose, (i) fea-
ture selection and (ii) feature transformation. In feature selection important features
are identified and used for classification. In feature transformation feature vector is
transformed into a new feature vector with selected lower dimensions.
The objective of this paper is to discuss, (i) filter-based feature selection methods,
(ii) feature transformation techniques, and (iii) machine learning techniques used for
text classification.
The remainder of the paper is organized as follows. Section 2 describes the text
classification process, Sect. 2.4 discusses the evaluation methods used for text clas-
sification, and Sect. 3 concludes the paper.
2 Text Classification Process
In text classification process, initially documents are read from the collection, then
preprocessing like stemming, removal of stop words takes place. After that, important
features are selected from the feature vector. Lower dimensional feature vector is fed
to the classifier. Common text classification methods include both supervised and
unsupervised machine learning methods like Support Vector Machine (SVM) [9],
K-Nearest Neighbour (KNN), Neural Network (NN), Naive Bayes [19] etc.
2.1 Preprocessing
The most common preprocessing task for text classification is that of stop-word
removal and stemming. In stop-word removal, the common words in the documents
which are not discriminatory to the different classes are removed from feature vector.
Text Classification Using Machine Learning Methods-A Survey 703
For example “a”, “the”, “that”, etc., are frequent words that do not help in classifi-
cation, which occurs almost equally in all the documents.
In stemming, different forms of the same word are converted into a single word.
For example, singular, plural, and different tenses are converted into a single word.
Port stemmer algorithm is well-known algorithm for stemming [7].
2.2 Text Representation
For text classification using machine learning methods, each document should be
represented in the form so that learning algorithm can be applied. So each document
is represented as a vector of words/terms/feature. The values in the feature vectors
are weighted to reflect the frequency of words in the documents and the distribution
of words across the collection. The more times a word/term occurs in a document,
the more it is relevant to the document. The more times the word occurs throughout
all documents in the collection, the more poorly it discriminates between documents
[15]. A popular weighting scheme is Term Frequency–Inverse Document Frequency
(TF-IDF): wij = tf(ij)*idf(i), where tf(ij) is the frequency of term i in document j,
and idfi is the inverse document frequency, it measures if a term is common or rare
across documents. IDF can be calculated by log(N/F), where N is total number of
documents in the corpus, F is number of documents where term I appears.
The tf *idf weighting scheme does not consider the length of document, tfc
weighting is similar to tf *idf weighting except, length normalization is used in
tfc weighting. In addition, a logarithm-based weighting scheme is log-weighted term
frequency that uses a logarithm of word frequency, reducing the effect of large num-
ber of term frequency in a document with big document length [19]. One method
is word frequency weighting, i.e., to use the frequency of the term in the document
[19]. Another method for text representation is to simply calculate binary feature
values, i.e., a term either present or not in the document [14].
2.3 Feature Reduction
Feature reductions methods are used to remove the irrelevant features and reduce
the dimensionality of feature space. Basically, there are two methods for feature
reductions (1) Feature selection, and (2) Feature extraction/Feature transformation.
Feature extraction means reduce the dimensionality by transformation/projection
of all the features in subset features. It maps the high-dimensional data on lower
dimensional space. New attributes are obtained by the combination of all the features,
for e.g., Principal Component Analysis (PCA) [22], Singular Value Decomposition
(SVD) [12]. Feature selection technique selects the important features/attributes from
the high-dimensional feature vector using certain criteria for e.g. Information Gain
(IG). Its main purpose is to reduce the dimensionality of the feature space, remove
704 B. Agarwal and N. Mittal
the irrelevant features so that performance and accuracy of the machine leaning
algorithm can be improved and also algorithm can run faster.
Feature selection methods are basically of three types depending on how they
selects feature from the feature vector, i.e., filter approach, wrapper approach, and
embedded approach [14, 22].
In filter approach [14, 22], all the features are treated independent to each other.
Features are ranked according to their importance score of each feature, which is
calculated by using some function. Filter approach-based methods does not depend
on the classifier. Advantages of this approach are that they are computationally sim-
ple, fast, and independent to the classifier. Feature selection step is performed once
and then reduced feature vector is used with any classifier can be used. Disadvan-
tage of this approach is that it does not interact with the classifier. It assumes fea-
tures are independent; it is possible that a feature performs well but performs worse
with the combination of other feature, and similarly a lower scoring attribute can
show good performance with the combination to other features [22]. However fil-
ter approach with some modification, included features dependency in multivariate
filtering approach.
In wrapper approach [14, 22], a search procedure is defined to search the feature
subset, and various subsets of features are generated and evaluated for a specific
classifier. In wrapper approach features are treated dependant to each other, and model
interacts with the classifier. As the number of features subset grows exponentially
with increase in the number of features, hence heuristic search methods can be used
for selecting feature subsets. Advantages of this approach are that it interacts with
the classifier, and features dependencies are considered. Disadvantages are that there
is a risk of over fitting, slow and classifier-dependant.
Filter approach is very fast compared to wrapper approach, wrapper approach is
very efficient but specific for a classifier algorithm. It is time consuming. If size of
dataset is high than it is very difficult to create wrapper.
2.3.1 Filter-based Feature Selection Methods
Document Frequency
Document Frequency (DF) is the number of documents in which a term appears.
In document frequency thresholding, those terms are removed whose document fre-
quency is less than a predefined value. This is an unsupervised feature selection
method; it can be computed without class labels. Assumption is that rare terms are
less informative for learning algorithms [3, 4], and frequent words have more chances
that they will be present in future test cases.
Information Gain
Information gain measures decrease in entropy when the feature value is given, means
number of bits of information obtained due to knowing the presence or absence of a
term for prediction [4].
Text Classification Using Machine Learning Methods-A Survey 705
First, Information gain for each term is computed. Further, terms are removed
from the feature vector whose value is below predefined threshold value [4].
Mutual Information
Mutual information of a term and class attribute is used for feature selection methods.
Mutual information is used to quantitatively analyze the relationship between any
two features or between a feature and a class variable. Mutual information compares
the probability of occurring term t and class c together and probability of term and
class individually [6, 22]. The mutual information of between term t and class c is
defined as
P(t, c) P(t ∧ c)
I (t, c) = log = log (1)
P(t)*P(c) P(t)*P(c)
If there is a relationship between term and class then joint probability P(t,c) will
be greater than the P(t)*P(c), and I(t,c) >> 0. High value of mutual information
of a feature with the class indicates higher importance of feature for classification.
Threshold value can be set for selecting the features.
Chi Square
The chi squared measures the lack of independence between term t and class c. it can
be used for testing independence or association between two variables. Chi squared
statistic test tries to identify the best terms for the class c and are the ones which
are distributed most differently in the sets of positive and negative examples of class
c [1, 2].
Odds Ratio
Odds Ratio is a fraction of the word occurring in the positive class normalized by that
of the negative class. It has been used for relevance ranking in information retrieval. It
is based on the assumption that the distribution of features on the relevant documents
is different from the distribution of features on the nonrelevant documents [17].
2.3.2 Feature Transformation
Feature transformation techniques are used to reduce the feature vector size, it does
not rank the features according to their importance but it transforms higher dimen-
sional feature space on the lower dimensional feature space.
Singular value decomposition can be used for feature reduction for text classifica-
tion. Latent Semantic Analysis (LSA) uses singular value decomposition method for
mapping high-dimensional features to lower dimensional space that is latent semantic
space [12].
Principal Component Analysis (PCA) is a common method for feature transfor-
mation. PCA seeks a linear projection of high-dimensional data into lower dimen-
sional space in such way that maximum variance is extracted from the variables.
These extracted variables are called principal components those are orthogonal to
each other and uncorrelated. Principal Component Analysis rejects data with small
variance [11].
706 B. Agarwal and N. Mittal
Linear Discriminant Analysis (LDA) is one of the popular dimension reduction
methods. It finds out the feature that has high-class discriminant capability. Dis-
criminant features are identified by maximizing the ratio of the between-class to
the within-class variance of a given data set. So a feature scattered more among
different classes and less scattered within class is important for the classification. A
novel text classification method is proposed which is based on LDA and SVM. High-
dimensional feature vector is transformed into lower dimensional feature vector by
LDA feature reduction technique. Then SVM classifier is used for text classifica-
tion [15].
Independent Component Analysis (ICA) [16], on the other hand, is to identify
independent components. ICA transforms the original high dimensional data into
lower dimensional components that are maximally independent from each other.
These independent components are not necessarily orthogonal to each other like PCA.
For dimension reduction ICA finds k components that effectively contain maximum
variability of the original data.
2.4 Classifier Models
There has been active research in text classification over the past few years. Most
of the research work in text classification has focused on applying machine-learning
methods to classify text based on words from a training set [1, 18, 19]. These
approaches include Naïve Bayes (NB) classifiers, SVM, K-Nearest Neighbor (KNN),
Decision Tree, Rocchio algorithm, etc., and also by combining approaches.
Naïve Bayes classifier assumes independence among attributes. NB approach’s
implementation is simple and learning time is less, however, its performance is not
good for categories defined with very few features [21, 25]. It gives a good classi-
fication result of a text document provided there are a sufficient number of training
instances of each category. Gini index-based weighted features is combined with
NB classifier, this approach improved the performance for text classification [10].
Bayesian classifier is modified to handle one hundred thousand of variables. Exper-
iment result shows that modified tree-like Bayesian classifier works with sufficient
speed and accuracy [2]. Maximum entropy is used for a new text classifier proposed
in [8], resulting in better performance in contrast to bayes classifier.
SVM produces good results for two class classification problems like text docu-
ment belongs to a particular category or not, but it is difficult to extend to multiclass
classification. To solve multiclass problems of SVM more efficiently, class incremen-
tal approach is proposed in [23]. SVM outperforms with KNN and naïve Bayesian
classifier for text classification as proposed in [28]. Naïve Bayesian method was used
as a preprocessor for dimensionality reduction followed by the SVM method for text
classification [5].
A modified k-NN-based text classification is proposed, in which variants of the
k-NN method with different decision functions, k values, and feature sets were evalu-
ated to increase the performance of the algorithm [9]. An improved k-NN algorithm
Text Classification Using Machine Learning Methods-A Survey 707
is proposed in which unimportant documents are not considered, to increase the
performance of the classification [13].
Decision Tree-based text classification does not assume independence among its
features as in Naïve Baysian. Decision tree performs well as a text classifier when
there are very less number of features; however it becomes difficult to create a
classifier for large number of feature [19].
In Rocchio Algorithm, text is indicated as an N-dimensional vector. N is the total
number of features, and each feature item is weighted by TF-IDF algorithm. Training
text dataset is expressed as a feature vector, and then generated the prototype vector
for each class. At the time of classification, similarity between different class features
vectors and feature vector of unknown text document is calculated, and the text is
assigned to the class which has highest similarity [19].
Boosting and Bagging are two voting-based classifiers. In voting classifier, train-
ing samples are taken randomly from the collection multiple times, and different
classifiers are learned. To classify a new sample, each classifier gives a different
class label; the result of voting classifier is decided by the maximum votes earned for
a particular class [29]. Main difference between bagging and boosting is in the way,
they take the samples for training a classifier. In bagging, training samples are taken
with equal weights randomly, and in boosting, more weightage are given to those
samples which have been misclassified by previous classifiers. AdaBoost which is
a boosting classifier outperforms rocchio when the training dataset contains a very
large number of relevant documents [20].
Feature vector is fed to the inputs of the neural network and classification results
come from the output of the network. Problem with the neural network is its slow
learning. Performance of neural network-based text classification was improved by
assigning the probabilities derived from Naïve Bayesian method as initial weights
[24]. In [27], three neural networks, i.e. (i) the Competitive, (ii) the Back Propagation
(BP), and (iii) the Radial Basis Function (RBF), in text classification are compared.
The competitive network is an unsupervised and BP and RBF are supervised methods
for learning. Experimental results show that BP works effectively for text classifica-
tion, RBF network learns faster compare to others. BP and RBF perform better than
competitive network. A modified back propagation neural network is proposed to
improve the performance of traditional algorithm. SVD technique is used for reduc-
ing the dimension of the feature vector. Experimental results show that the modified
neural network outperforms traditional back propagation NN [26].
There is a need to experiment with more such hybrid techniques in order to
derive the maximum benefits from machine learning algorithms and to achieve better
classification results. Different feature selection and reduction techniques are used in
combination with different machine learning algorithm to increase the performance
and accuracy of the classifier.
708 B. Agarwal and N. Mittal
3 Conclusion
The commercial importance of automatic text classification applications has increased
due to the number of blogs, Web contents, growth rate of Internet access. Therefore,
much research is currently focused in this area. Performance of text classification
can be increased using machine learning techniques. However preprocessing plays
important role due to high- dimensional data, and feature selection and reduction tech-
niques enhances the quality of training data for the classifier, resulting into improved
classifier accuracy.
Text classification for regional language documents can be useful for several gov-
ernmental and commercial projects. Multitopic text classification, identify contextual
use of terms on blogs and use of semantics for better classifiers are some of the areas,
where future research can be done.
References
1. Sebastiani, F.: Machine learning in automated text categorization. ACM Comput. Surv. 34(1),
1–47 (2002)
2. Al-Harbi, S., Almuhareb, A., Al-Thubaity, A., Khorsheed, M., Al-Rajeh, A.: Automatic Arabic
text classification. In: JADT’08, France, pp. 77–83 (2008)
3. Forman, George: An extensive empirical study of feature selection metrics for text classifica-
tion. J. Mach. Learn. Res. 3, 1289–1305 (2003)
4. Yang, Y., Pedersen, J.O.: A Comparative study on feature selection in text categorization. In:
Proceedings of the Fourteenth International Conference on Machine Learning, pp. 412–420,
08–12 July 1997
5. Isa, D., Lee, L.H., Kallimani, V.P., RajKumar, R.: Text document pre-processing with the Bayes
formula for classification using the support vector machine. IEEE Trans. Knowl. Data Eng.
20(9), 1264–1272 (2008)
6. Yan, X., Gareth J., Li J.T., Wang, B., Sun, C.M.: A study on mutual information-based feature
selection for text categorization’. J. Comput. Inf. Syst. 3(3), 1007–1012 (2007)
7. Porter, M.F.: An algorithm for suffix stripping. Program 14(3). 130–137 (1980)
8. Nigam, K., Mccallum, A.K., Thrun, S., Mitchell, T.: Text classification from labeled and unla-
beled documents using EM. Mach. Learn. 39, 103–134 (2000)
9. Joachims, T.: A statistical learning model for text classification for support vector machines. In:
24th ACM International Conference on Research and Development in Information Retrieval
(SIGIR) (2001)
10. Dong, Tao, Shang, Wenqian, Zhu, Haibin: An improved algorithm of Bayesian text categoriza-
tion. J. Softw. 6(9), 1837–1843 (September 2011)
11. Kumar, C.A.: Analysis of unsupervised dimensionality reduction techniques. Comput. Sci. Inf.
Syst. 6(2), 217–227 (Dec. 2009)
12. Soon, C.P.: Neural network for text classification based on singular value decomposition. In:
7th International conference on Computer and Information Technology, pp. 47–52 (2007)
13. Muhammed, M.: Improved k-NN algorithm for text classification. Department of Computer
Science and Engineering University of Texas at Arlington, TX, USA
14. Ikonomakis, M., Kotsiantis, S., Tampakas, V.: Text classification using machine learning tech-
niques. IEEE Trans. Comput. 4(8) 966–974 (2005)
15. Wang, Z, Qian, X.: Text categorization based on LDA and SVM. In: Computer Science and
Software Engineering, 2008 International Conference, vol. 1, pp. 674–677 (2008)
Text Classification Using Machine Learning Methods-A Survey 709
16. Kolenda, T., Hansen, L.K., Sigurdsson, S.: Independent components in text. In: Girolami, M.
(ed.) Advances in Independent Component Analysis, Springer-Verlag, New York (2000)
17. Jia-ni, H.U., Wei-Ran, X.U. Jun, G., Wei-Hong, D.: Study on feature methods in chinese text
categorization. Study Opt. Commun. 3, 44–46 (2005)
18. Aggarwal, C.C., Zhai, C-X.: A survey of text classification algorithms. Mining Text Data. pp.
163–222, Springer (2012)
19. Aas, K., Eikvil, L.: Text categorisation: A survey”m Tech. rep. 941. Norwegian Computing
Center, Oslo, Norway (1999)
20. Schapire, R.E., Singer, Y., Singhal, A.: Boosting and Rocchio applied to text filtering. In:
Proceedings of SIGIR-98 21st ACM International Conference on Research and Development
in Information Retrieval, pp. 215–223, ACM Press, New York US (1998)
21. Kim, S.B., Rim, H.C., Yook, D.S., Lim, H.S.: Effective Methods for Improving Naive Bayes
Text Classifiers. LNAI 2417, 414–423 (2002)
22. Saeys, Y., Inza, I., Larranaga, P.: A review of feature selection techniques in bioinformatics.
Bioinformatics 23(19), 2507–2517 (2007)
23. Zhang, B., Su, J., Xu, X.: A class-incremental learning method for multi-class support vector
machines in text classification. In: Proceedings of the 5th IEEE international conference on
Machine Learning and, Cybernetics, pp. 2581–2585 (2006)
24. Goyal, R.D.: Knowledge based neural network for text classification. In: Proceedings of the
IEEE international conference on Granular, Computing, pp. 542–547 (2007)
25. Meena, M.J., Chandran, K.R.: Naïve bayes text classification with positive features selected
by statistical method. In: Proceedings of the IEEE international conference on Advanced,
Computing, pp. 28–33 (2009)
26. Li, C.H, Park, S.C.: An efficient document classification model using an improved back prop-
agation neural network and singular value decomposition. J. Expert Syst. Appl. 36(2), pp.
3208–3215 (2009)
27. Wang, Z., He, Y., Jiang, M.: A comparison among three neural networks for text classification.
In: 8th IEEE International Conference on, Signal Processing (2006)
28. Zhijie, L., Lv, X., Liu, K., Shi, S.: Study on SVM compared with other text classification
methods. In: 2nd International workshop on education technology and computer, science (2010)
29. Freund, Y., Shapire, R.R.: Experiments with a new boosting algorithm. In: Proceedings of 13th
International Conference on, Machine learning, pp. 148–156 (1996)