0% found this document useful (0 votes)
18 views5 pages

Text Classification Paper 1

Uploaded by

Tajbia Hossain
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
18 views5 pages

Text Classification Paper 1

Uploaded by

Tajbia Hossain
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 5

Third International Symposium on Information Processing

Data imbalance problem in text classification

Yanling Li, Guoshe Sun Yehang Zhu


Xi’an Research Institute of Hi-technology School of Management Engineering, Xi'an Institute of
Xi’an, China Posts and Telecommunications
[email protected] Xi’an, China

Abstract—Aimming at the ever-present problem of imbalanced Data imbalance problem has been proposed over ten
data in text classification, the authors study on several forms of years, more and more researchers find that the data they face
imbalanced data, such as text numberǃclass sizeǃsubclass is uneven, the ideal classification effect cannot be achieved,
and class fold. Some useful conclusions are gotten from a series so data imbalance problem is gradually paid more attention
of correlative experiments: first, when the text of two class is to. In paticular, the International Conference AAAI and
almost the same number ,the difference of word number ICML which were organized in 2000 and 2003, raises
become major factor to affect the accuracy of the awareness of the data imbalance problem [2].
classification; second, to improve the accuracy of the This paper firstly analyzes some form of data imbalance,
classification through increasing the small class size is limited; application background is classification problem which
third, in the case of unbalanced data, the same words which dividing into two categories. Such as Data distribution, class
are appeared in two class often carry strong class information,
size, categories overlap, these factors' impact on
that is, class overlap will not affect the classification accuracy.
classification accuracy has been studied, combined with
Keywords- text classification; imbalanced data; data related experiment, get some important conclusions.
distribution; class size; class fold
II. DATA IMBALANCE
Generally considered, Data imbalance is mainly
I. INTRODUCTION
embodied in differences in the number of samples between
Imbalance of training data generally means: some classes categories. According to statistics, in practice, the sample
have more samples, while some classes have relatively few size ratio between big and small categories can be 1:100,
or very few samples[1]. In this case, Standard classification 1:1000, even 1:10000 [2]. Many studies show that such a
would tend to over-adapt big classes, ignore small classes. In large difference in the number of samples will result in lower
many application areas of data mining or machine learning, classification performance. In addition, when the text
the problem of data imbalance is ubiquitous[2]. For example, number of various categories is roughly same, will category
in monitoring of public opinion, information security and imbalance still exist? From both theoretical and
supervision, usually the number of texts who hold a positive experimental, reference [5, 6] research and verify that data
view is majority, the number of texts who adopt a negative or skew is not the only factor affecting the classification
reactionary ideas is small. As another example, intrusion performance, sample size of small category, independence of
detection, medical diagnosis, and risk management samples, subclass exist in category, all of these will affect the
applications will face the problem of unbalanced data. Some classification performance. That is, data imbalance is not
of the most commonly used classification algorithm only reflected in the difference of text number of different
currently, for example, decision tree, support vector machine class, but also include class size, sub-categories, categories
SVM, KNN, neural network, Bayesian network, and overlap, etc.
association rule mining methods, are not adaptive to dealing
with unbalanced data[3,4]. For example, based on SVM, A. Data Distribution Imbalance
from both theoretical and experimental point of view, Data distribution imbalance mainly refers to different text
document [3] verify the impact of the class distribution on number of different class. When analyzing the influence of
the traditional classification algorithms, namely classification data imbalance to classification, most of the existing study
and prediction of algorithm on small class is worse than on focused on this reason, generally class distribution imbalance
big class. It is worth noting, however, the requirements of the is expressed by the text number ratio of small and big class.
correct classification of small categories is even higher in A large number of studies show that the existing
practical application. For example, in the bad information classification model can achieve better classification results
filtering, The number of bad information is far less than the when category distribution is equilibrium, will be affected in
normal information, the goal of filtering is to identify bad varying degrees when data distribution is not equilibrium.
information and to filter them. For these reasons, data But there is no clear conclusion about affect of unbalanced
imbalance problem is urgent problem in the field of data class distribution to classification. Reference [6] has shown,
mining and machine learnings. in some applications, until the ratio of text number between
small and big class reaches 1:35, the classification results

978-0-7695-4261-4/10 $26.00 © 2010 IEEE 301


DOI
will not be affected. But in other applications, classification the house price is not high. In order to achieve the purpose of
results have been significantly affected when the ratio is opening test, data sets were divided by time. 829 posts
1:10. From this study, this paper has been inspired that the downloaded firstly were regarded as training set, 81 posts
text number difference is only the appearance factors of downloaded secondly were regarded as test set. As shown in
uneven data distribution, There should be a deeper reason. table 1.
B. class size C. Classification performance index
Class size means text number in one class. Yanmin Sun The performance of classification method was measured
and others thought [2], when the class imbalance degree is by three kinds of common assessment indexes: Precision(P),
fixed, class size has become a major factor affecting the Recall rate(R) and F1 test value[9,10]. For the overall
classification results. The main reason is when the class size classification result, macro average mode was adopted, mean
is limited, classification rules can not cover some inherent value is calculated on the classification results of all
nature of small categories, therefore it is not reliable. By categories, to get macro-p, macro-r and macro-F1.
experimental observation G.E.A.P.A.Batista [4] etc. pointed
out, with the growth of class size, misclassification caused D. Experimental results and analysis
by class distribution imbalance will be reduced. If data sets is 1) Influence of data distribution imbalance on
enough, and assume that learning time is acceptable, class classification.
distribution imbalance will no longer be an obstacle of a) Experiment
classification.
Experiment 1, experiments on initial unbalanced data
C. Class overlap sets. The results were shown in table 2.
Class overlap means there is overlap between the class Experiment 2, 111 texts were randomly selected from
concept. Reference [7] concluded: class distribution the initial training set C1, to reduce the chance of
imbalance is not a problem itself, only when the overlap experiment, a total of 5 experiments were made, each time
between categories were more, category distribution 111 texts were randomly selected from data set C1.
imbalance will reduce the correct classification rate of small Experiment results were shown in Fig.1.
categories. Reference [5] also obtained similar results: linear b) Experimental analysis
classification model is not sensitive to the uneven class • From experiment result, classification results based
distribution, but with the increased level of category overlap, on the initial imbalanced training test is poor, recall
sensitivity of classification system to class distribution ratio of class C1 reached 100%, precision of class
imbalance will increase. C2 reached 100%, this indicates that all texts in class
C1 were correctly classified, precision of class C1
D. Sub-category problem
and recall ratio of class C2 were lower because many
In many practical classification problems, phenomena texts of class C2 were divided into class C1 wrongly,
that there may be several sub-class in one class will exist. obviously classifier prefer big class C1.
Typically, the text number of sub-classes are often not the
same, this constitutes class imbalance problem. In addition, TABLE I. EXPERIMENT DATA SETS
the distinction between subclasses was generally not
class Training Set Test Set
significant. That is, there will be class overlap. Document [2]
think the above reasons will increase the complexity of C1 718 42
training classification model. C2 111 39

III. IMPACT ANALYSIS OF DATA IMBALANCE TO


TABLE II. CLASSIFICATION RESULTS BASED ON UNBALANCED DATA
CLASSIFICATION
class P(%) R(%) F1(%)
A. Classification method C1 60.87 100 75.68
In this paper, classification method based on class spatial
C2 100 30.77 47.06
model which was proposed by us is adopted [8], its basic
idea is to score the text to be classified based on word
classification weight, according to the score of the text to
each class, the text was included into the class of high scores.
B. Experimental data sets
Experimental data used in this article is posts
downloaded from People's Network powerful forum, whose
search keywords is “ ᠓ Ӌ ”(namely house price), have
downloaded twice, about one month time interval. According
to different views of post content, posts can be divided into Figure 1. Classification result of 111 texts randomly selected from class
C1
two categories: C1, think the house price is high ; C2, think

302
• From Fig.1, when text number of two class is same,
the classification accuracy is significantly improved.
However, further analysis discover that classification
results of two class are still different now, as shown
in Fig.2. Recall ratio of class C1 is generally higher,
the precision is generally lower. High recall ratio
shows that the number of samples in class C1 which
were divided into class C2 is few, that is, most of the
text in class C1 are correctly classified, but its
precision is lower, because many samples of class Figure 3. Feature words number comparison of two class
C2 were wrongly divided into class C1, thus affected
the precision of class C1, this result show that the • Among 5 sets of experiments, classification
classifier prefer class C1. Then text number of two accuracy of group 4 is highest, classification
class is same, why does classifier prefer class C1 accuracy of group 3 is the second highest. As shown
still? By careful analyse it is found that text length of in Fig. 3, the feature word number of two class in
class C1 is generally longer, when extracting feature group 3 and 4 is more closer than in other group,
words with the most commonly used feature however, gap between the number of feature words
extraction algorithm currently, text length difference of two class in group 4 is slightly larger than in
will obviously affect the number of feature words, so group 3. This indicates that the closer of the number
the number of feature words of class C1 is more than of feature word of two class is not the better, only
class C2, as shown in Fig.3. Generally the number need to close to within a certain range. Why not the
difference of feature words will directly affect the closer of the word number the better? By analysing
classification accuracy. Because usually the learning of its causes we find, from the initial training
algorithm use three statistical characteristics of collections, difference of sample distribution of two
feature words: word frequency, word’s document classes is mainly embodied in: the text number of
frequency, word distribution, imbalance of text class C1 is more, the text length is relatively long,
number of different class and imbalance of word terms used would be relatively divergent, the text
frequency will obviously affect these three statistical number of class C2 is fewer, the text length is
properties. For example, the common feature relatively short, terms used would be relatively
weighting method tf*idf, obviously, the text length concentrated. Thus, the data itself is from the real
will directly affect tf(t,d), text amount will directly world, there is a certain difference between the text
affect idf(t,d), that is, when the text number of two number and text length of two classes, blindly to ask
class is almost the same, the text length difference the same number of feature words of two classes,
will affect the calculation of the weights of feature will undermine some distribution feature of the
words, and then affect the classification accuracy. original data.
• Summary of the experiment result and analysis
above, from the point of view of training text set
distribution, the difference of text number is clearly
not the only factor causing the data imbalance, when
the text number of two class is about the same,
differences in text length will cause different number
of feature words of two class, thus significantly
affect the classification result. But how to resolve the
(a) Comparison of precision of two class data distribution imbalance caused by the difference
of text length, need further study, we believe that
classification algorithm used, the original
distribution feature of data sets and other factors
must be considered.
2) The impact of sub-category size on classifications
G.e.a.p.a.batista [4] and others pointed out, if the learning
time is enough, as long as increasing class size, classification
results will no longer be affected by class distribution
(b) Comparison of recall rate of two class imbalance. But to increase the class size large enough, it is
Figure 2. Comparison of classification result of two class often not feasible in practice, the sample number of some
small class is limited, to obtain them is difficult, cannot reach
the ideal size. If some small class samples is generated by
artificial, or by random copying and other methods, increase
the small class size, may result in over-adaptation to small
class, made classification accuracy affected.

303
a) Experiment. opinions for the same theme: house price, thus, the
By random replicating samples of class C2, make the C2 concept overlap degree between class is higher, that
class size expanded, a total of five experiments were done, is, the same words in both categories are more.
the text number of class c2 is 222, 360, 700, 718, 730 • As shown in table 3, remove the same words in both
respectively, the experiment results is shown in Fig. 4. class, that is, reduce the degree of class overlap,
b) Experimental analysis classification precision will reduce, by analysis of its
causes we find, in the experiment data set
Gradually increasing the size of small class C2 in the distribution on the two class is uneven, as the text
above experiment, the overall performance of classifier number and language features of two class is
gradually improve, but when increase to some extent, different, the same word’s statistical properties will
classification performance is no longer changed. At this be different in the appropriate category, class
point, the experimental results shown in Fig. 4 shows, the information it carried is not same, and this is easier
recall rate of class C1 was significantly higher than class C2, to distinguish, therefore, these same words carry
the precision of class C2 was higher than class C1, this strong class information.
indicates that the preference of classifier to big class C1 has
• As shown in table 4, DF, MI and IG methods whose
not yet effectively addressed.
classification results is better, after feature selection,
Above comprehensive analysis shows, when facing
among the final feature words set, ratio of same
unbalanced data, class size can only influence the
word number to total word number is higer, it is
classification accuracy within a certain range, if expand the
more than 90% of the total number of words, while
size of small class by random replication, although it can
the CHI method whose accuracy is lower, after
improve the classifier performance, however, only to a
feature selection, ratio of same word number to total
limited extent, but also increasing training time, obtaining
word number is lowest, it is only 33.79% of the total
samples difficultly. Therefore, through the unrestricted
number of words. This result confirms the
expansion of class size to improve the classifier
conclustion of experiment 1 even more, when data
performance, is generally not feasible.
imbalance, some same words carry strong
3) Influence of class overlap on classification classification information, that is, categories overlap
In the classification process, first of all, text narrated by at this time is not the main factor affecting the
natural language were expressed as the formal form that the classification accuracy.
computer can handle, generally vector form composed with
feature words, therefore, concept overlap between different
class exists in the objective reality world, reflected in the
classification process, actually show that the same words in
different class are more, that is, there are more identical
feature words belonging to different class.
Related research demonstrated, the increase of class
overlap degree is the real reason affecting the classification
performance[5,7], if the degree of class overlap is very low,
the uneven distribution of data class will not affect (a) precision comparison of two class
classification. To address this problem, this paper carried out
related experiments.
a) Experiment
Experiment 1, divided into four groups: group 1, no
feature selection, group 2, to remove all of the words
appearing in the two class concurrently, group 3, to remove
the very high frequency words appearing in both class, group
4, to remove the low frequency words appearing in both
(b) recall rate comparison of two class
class. Among them, the word appearance frequency is
calculated by DF [9], when the DF is less than two, the word Figure 4. Classification effects by increasing small class size
is regarded as low-frequency word, the experiment results
are shown in table 3. TABLE III. EXPERIMENT RESULT OF THE IMPACT OF SAME TERM IN
Experiment 2, do experiments respectively with common BOTH CLASS ON THE CLASSIFICATION
used feature selection method: DF, MI, CHI, IG[9-11], made
a number of experiments, for each method, select the best Macro-value 1 2 3 4
classification results, as shown in table 4. Macro-p(%) 82.53 73.12 81.44 81.44
b) Experimental analysis Macro-R(%) 81.14 68.32 78.48 78.48
• Experimental data is from the internet, which is Macro-F1(%) 81.83 70.64 79.93 79.93
divided into two categories according to the point of
view about the house price, both sides express their

304
TABLE IV. EXPERIMENT RESULT OF THE IMPACT OF SAME TERM IN [3] Enhui Zheng,Hong Xu,Ping Li,etc.Mining Knowledge from
BOTH CLASS ON THE CLASSIFICATION Unbalanced Data Based on v-support Vector Machine.Journal of
hejiang University (Engineering Science),2006,Vol.40,No.10,
The Total Number pp.1682-1687.
Macro Macro Macro
Methods Number of Same
-P(%) -R(%) -F1(%) [4] G.E.A.P.A.Batista,R.C.Prati,M.C.Monard.A study of the behavior of
of Words Words
several methods for balancing machine learning training
DF 86.86 86.63 86.74 282 251 data.SIGKDD Explorations Special Issue on Learning from
MI 86.4 86.45 85.42 541 541 Imbalanced Datasets,2004,Vol.6,No.1, pp.20-29.
[5] G.Weiss.Mining with rarity:a unifying framework.SIGKDD
CHI 84.55 83.7 84.12 1598 540
Explorations Special Issue on Learning from Imbalanced
IG 88.93\2 88.83 88.87 512 512 Datasets.2004,Vol.6,No.1,pp.7-19.
[6] M.V.Joshi.Learning classifier models for predicting rare
IV. CONCLUSION phenomena[Ph.D.Thesis].University of Minnesota,2002.
[7] R.C.Prati,G.E.A.P.A.Batista.Class imbalances versus class
Data imbalance problem often appear in the field of text overlapping an analysis of a learning system behavior.In Proceedings
classification currently, this paper focus on analysing several of the Mexican International Conference on Artificial
forms of data imbalance including text distribution, class size Intelligence(MICAI),2004,pp.312-321.
and class overlap, through a series of experiments, get a [8] Yanling Li, Guanzhong Dai, and Sen Qin, “Text tendency
number of important conclusions with practical value. categorization method based on class space model”, Computer
Application, 2007,Vol.27, No.9, pp.2194-2196.
[9] Liuling Dai,Heyan Huang, and Zhaoxiong Chen, “A commparative
REFERENCES study on feature selection in Chinese text categorization”, Journal of
Chinese Information Processing, 2004, Vol.18,No.1,pp.26-32
[1] N.V.Chawla,N.Jaspkowicz,A.Kotcz.Editorial:Special issue on
[10] Hailong Zhang ,and Lianzhi Wang, “Automatic text categorization
learning from imbalanced data sets.Sigkdd Explorations
feature selection methods research”,Computer Engineering and
Newsletters,2004,Vol.6,No.1, pp. 1-6.
Design, 2006,Vol.27,No.20, pp.383-3841.
[2] Yanmin Sun,S.Mohamed,and Kamel,etc.Cost-sensitive boosting for
[11] Tao Chen, and Yangqun Xie, “Literature review of feature dimension
classification of imbalanced data.Pattern Rfecognition,2007,No.40,
reduction in text categorization”, Journal of The China Society For
pp.3358-3378.
Scientific and Technical Information, 2005,Vol.24,No.6,pp.691-695.

305

You might also like