Text Classification Paper 1

Uploaded by

Tajbia Hossain

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

18 views5 pages

Text Classification Paper 1

Uploaded by

Tajbia Hossain

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 5

Third International Symposium on Information Processing

Data imbalance problem in text classification

Yanling Li, Guoshe Sun Yehang Zhu

Xi’an Research Institute of Hi-technology School of Management Engineering, Xi'an Institute of
Xi’an, China Posts and Telecommunications
[email protected] Xi’an, China

Abstract—Aimming at the ever-present problem of imbalanced Data imbalance problem has been proposed over ten
data in text classification, the authors study on several forms of years, more and more researchers find that the data they face
imbalanced data, such as text numberǃclass sizeǃsubclass is uneven, the ideal classification effect cannot be achieved,
and class fold. Some useful conclusions are gotten from a series so data imbalance problem is gradually paid more attention
of correlative experiments: first, when the text of two class is to. In paticular, the International Conference AAAI and
almost the same number ,the difference of word number ICML which were organized in 2000 and 2003, raises
become major factor to affect the accuracy of the awareness of the data imbalance problem [2].
classification; second, to improve the accuracy of the This paper firstly analyzes some form of data imbalance,
classification through increasing the small class size is limited; application background is classification problem which
third, in the case of unbalanced data, the same words which dividing into two categories. Such as Data distribution, class
are appeared in two class often carry strong class information,
size, categories overlap, these factors' impact on
that is, class overlap will not affect the classification accuracy.
classification accuracy has been studied, combined with
Keywords- text classification; imbalanced data; data related experiment, get some important conclusions.
distribution; class size; class fold
II. DATA IMBALANCE
Generally considered, Data imbalance is mainly
I. INTRODUCTION
embodied in differences in the number of samples between
Imbalance of training data generally means: some classes categories. According to statistics, in practice, the sample
have more samples, while some classes have relatively few size ratio between big and small categories can be 1:100,
or very few samples[1]. In this case, Standard classification 1:1000, even 1:10000 [2]. Many studies show that such a
would tend to over-adapt big classes, ignore small classes. In large difference in the number of samples will result in lower
many application areas of data mining or machine learning, classification performance. In addition, when the text
the problem of data imbalance is ubiquitous[2]. For example, number of various categories is roughly same, will category
in monitoring of public opinion, information security and imbalance still exist? From both theoretical and
supervision, usually the number of texts who hold a positive experimental, reference [5, 6] research and verify that data
view is majority, the number of texts who adopt a negative or skew is not the only factor affecting the classification
reactionary ideas is small. As another example, intrusion performance, sample size of small category, independence of
detection, medical diagnosis, and risk management samples, subclass exist in category, all of these will affect the
applications will face the problem of unbalanced data. Some classification performance. That is, data imbalance is not
of the most commonly used classification algorithm only reflected in the difference of text number of different
currently, for example, decision tree, support vector machine class, but also include class size, sub-categories, categories
SVM, KNN, neural network, Bayesian network, and overlap, etc.
association rule mining methods, are not adaptive to dealing
with unbalanced data[3,4]. For example, based on SVM, A. Data Distribution Imbalance
from both theoretical and experimental point of view, Data distribution imbalance mainly refers to different text
document [3] verify the impact of the class distribution on number of different class. When analyzing the influence of
the traditional classification algorithms, namely classification data imbalance to classification, most of the existing study
and prediction of algorithm on small class is worse than on focused on this reason, generally class distribution imbalance
big class. It is worth noting, however, the requirements of the is expressed by the text number ratio of small and big class.
correct classification of small categories is even higher in A large number of studies show that the existing
practical application. For example, in the bad information classification model can achieve better classification results
filtering, The number of bad information is far less than the when category distribution is equilibrium, will be affected in
normal information, the goal of filtering is to identify bad varying degrees when data distribution is not equilibrium.
information and to filter them. For these reasons, data But there is no clear conclusion about affect of unbalanced
imbalance problem is urgent problem in the field of data class distribution to classification. Reference [6] has shown,
mining and machine learnings. in some applications, until the ratio of text number between
small and big class reaches 1:35, the classification results

DOI
will not be affected. But in other applications, classification the house price is not high. In order to achieve the purpose of
results have been significantly affected when the ratio is opening test, data sets were divided by time. 829 posts
1:10. From this study, this paper has been inspired that the downloaded firstly were regarded as training set, 81 posts
text number difference is only the appearance factors of downloaded secondly were regarded as test set. As shown in
uneven data distribution, There should be a deeper reason. table 1.
B. class size C. Classification performance index
Class size means text number in one class. Yanmin Sun The performance of classification method was measured
and others thought [2], when the class imbalance degree is by three kinds of common assessment indexes: Precision(P),
fixed, class size has become a major factor affecting the Recall rate(R) and F1 test value[9,10]. For the overall
classification results. The main reason is when the class size classification result, macro average mode was adopted, mean
is limited, classification rules can not cover some inherent value is calculated on the classification results of all
nature of small categories, therefore it is not reliable. By categories, to get macro-p, macro-r and macro-F1.
experimental observation G.E.A.P.A.Batista [4] etc. pointed
out, with the growth of class size, misclassification caused D. Experimental results and analysis
by class distribution imbalance will be reduced. If data sets is 1) Influence of data distribution imbalance on
enough, and assume that learning time is acceptable, class classification.
distribution imbalance will no longer be an obstacle of a) Experiment
classification.
Experiment 1, experiments on initial unbalanced data
C. Class overlap sets. The results were shown in table 2.
Class overlap means there is overlap between the class Experiment 2, 111 texts were randomly selected from
concept. Reference [7] concluded: class distribution the initial training set C1, to reduce the chance of
imbalance is not a problem itself, only when the overlap experiment, a total of 5 experiments were made, each time
between categories were more, category distribution 111 texts were randomly selected from data set C1.
imbalance will reduce the correct classification rate of small Experiment results were shown in Fig.1.
categories. Reference [5] also obtained similar results: linear b) Experimental analysis
classification model is not sensitive to the uneven class • From experiment result, classification results based
distribution, but with the increased level of category overlap, on the initial imbalanced training test is poor, recall
sensitivity of classification system to class distribution ratio of class C1 reached 100%, precision of class
imbalance will increase. C2 reached 100%, this indicates that all texts in class
C1 were correctly classified, precision of class C1
D. Sub-category problem
and recall ratio of class C2 were lower because many
In many practical classification problems, phenomena texts of class C2 were divided into class C1 wrongly,
that there may be several sub-class in one class will exist. obviously classifier prefer big class C1.
Typically, the text number of sub-classes are often not the
same, this constitutes class imbalance problem. In addition, TABLE I. EXPERIMENT DATA SETS
the distinction between subclasses was generally not
class Training Set Test Set
significant. That is, there will be class overlap. Document [2]
think the above reasons will increase the complexity of C1 718 42
training classification model. C2 111 39

III. IMPACT ANALYSIS OF DATA IMBALANCE TO

TABLE II. CLASSIFICATION RESULTS BASED ON UNBALANCED DATA
CLASSIFICATION
class P(%) R(%) F1(%)
A. Classification method C1 60.87 100 75.68
In this paper, classification method based on class spatial
C2 100 30.77 47.06
model which was proposed by us is adopted [8], its basic
idea is to score the text to be classified based on word
classification weight, according to the score of the text to
each class, the text was included into the class of high scores.
B. Experimental data sets
Experimental data used in this article is posts
downloaded from People's Network powerful forum, whose
search keywords is “ ᠓ Ӌ ”(namely house price), have
downloaded twice, about one month time interval. According
to different views of post content, posts can be divided into Figure 1. Classification result of 111 texts randomly selected from class
C1
two categories: C1, think the house price is high ; C2, think

302
• From Fig.1, when text number of two class is same,
the classification accuracy is significantly improved.
However, further analysis discover that classification
results of two class are still different now, as shown
in Fig.2. Recall ratio of class C1 is generally higher,
the precision is generally lower. High recall ratio
shows that the number of samples in class C1 which
were divided into class C2 is few, that is, most of the
text in class C1 are correctly classified, but its
precision is lower, because many samples of class Figure 3. Feature words number comparison of two class
C2 were wrongly divided into class C1, thus affected
the precision of class C1, this result show that the • Among 5 sets of experiments, classification
classifier prefer class C1. Then text number of two accuracy of group 4 is highest, classification
class is same, why does classifier prefer class C1 accuracy of group 3 is the second highest. As shown
still? By careful analyse it is found that text length of in Fig. 3, the feature word number of two class in
class C1 is generally longer, when extracting feature group 3 and 4 is more closer than in other group,
words with the most commonly used feature however, gap between the number of feature words
extraction algorithm currently, text length difference of two class in group 4 is slightly larger than in
will obviously affect the number of feature words, so group 3. This indicates that the closer of the number
the number of feature words of class C1 is more than of feature word of two class is not the better, only
class C2, as shown in Fig.3. Generally the number need to close to within a certain range. Why not the
difference of feature words will directly affect the closer of the word number the better? By analysing
classification accuracy. Because usually the learning of its causes we find, from the initial training
algorithm use three statistical characteristics of collections, difference of sample distribution of two
feature words: word frequency, word’s document classes is mainly embodied in: the text number of
frequency, word distribution, imbalance of text class C1 is more, the text length is relatively long,
number of different class and imbalance of word terms used would be relatively divergent, the text
frequency will obviously affect these three statistical number of class C2 is fewer, the text length is
properties. For example, the common feature relatively short, terms used would be relatively
weighting method tf*idf, obviously, the text length concentrated. Thus, the data itself is from the real
will directly affect tf(t,d), text amount will directly world, there is a certain difference between the text
affect idf(t,d), that is, when the text number of two number and text length of two classes, blindly to ask
class is almost the same, the text length difference the same number of feature words of two classes,
will affect the calculation of the weights of feature will undermine some distribution feature of the
words, and then affect the classification accuracy. original data.
• Summary of the experiment result and analysis
above, from the point of view of training text set
distribution, the difference of text number is clearly
not the only factor causing the data imbalance, when
the text number of two class is about the same,
differences in text length will cause different number
of feature words of two class, thus significantly
affect the classification result. But how to resolve the
(a) Comparison of precision of two class data distribution imbalance caused by the difference
of text length, need further study, we believe that
classification algorithm used, the original
distribution feature of data sets and other factors
must be considered.
2) The impact of sub-category size on classifications
G.e.a.p.a.batista [4] and others pointed out, if the learning
time is enough, as long as increasing class size, classification
results will no longer be affected by class distribution
(b) Comparison of recall rate of two class imbalance. But to increase the class size large enough, it is
Figure 2. Comparison of classification result of two class often not feasible in practice, the sample number of some
small class is limited, to obtain them is difficult, cannot reach
the ideal size. If some small class samples is generated by
artificial, or by random copying and other methods, increase
the small class size, may result in over-adaptation to small
class, made classification accuracy affected.

303
a) Experiment. opinions for the same theme: house price, thus, the
By random replicating samples of class C2, make the C2 concept overlap degree between class is higher, that
class size expanded, a total of five experiments were done, is, the same words in both categories are more.
the text number of class c2 is 222, 360, 700, 718, 730 • As shown in table 3, remove the same words in both
respectively, the experiment results is shown in Fig. 4. class, that is, reduce the degree of class overlap,
b) Experimental analysis classification precision will reduce, by analysis of its
causes we find, in the experiment data set
Gradually increasing the size of small class C2 in the distribution on the two class is uneven, as the text
above experiment, the overall performance of classifier number and language features of two class is
gradually improve, but when increase to some extent, different, the same word’s statistical properties will
classification performance is no longer changed. At this be different in the appropriate category, class
point, the experimental results shown in Fig. 4 shows, the information it carried is not same, and this is easier
recall rate of class C1 was significantly higher than class C2, to distinguish, therefore, these same words carry
the precision of class C2 was higher than class C1, this strong class information.
indicates that the preference of classifier to big class C1 has
• As shown in table 4, DF, MI and IG methods whose
not yet effectively addressed.
classification results is better, after feature selection,
Above comprehensive analysis shows, when facing
among the final feature words set, ratio of same
unbalanced data, class size can only influence the
word number to total word number is higer, it is
classification accuracy within a certain range, if expand the
more than 90% of the total number of words, while
size of small class by random replication, although it can
the CHI method whose accuracy is lower, after
improve the classifier performance, however, only to a
feature selection, ratio of same word number to total
limited extent, but also increasing training time, obtaining
word number is lowest, it is only 33.79% of the total
samples difficultly. Therefore, through the unrestricted
number of words. This result confirms the
expansion of class size to improve the classifier
conclustion of experiment 1 even more, when data
performance, is generally not feasible.
imbalance, some same words carry strong
3) Influence of class overlap on classification classification information, that is, categories overlap
In the classification process, first of all, text narrated by at this time is not the main factor affecting the
natural language were expressed as the formal form that the classification accuracy.
computer can handle, generally vector form composed with
feature words, therefore, concept overlap between different
class exists in the objective reality world, reflected in the
classification process, actually show that the same words in
different class are more, that is, there are more identical
feature words belonging to different class.
Related research demonstrated, the increase of class
overlap degree is the real reason affecting the classification
performance[5,7], if the degree of class overlap is very low,
the uneven distribution of data class will not affect (a) precision comparison of two class
classification. To address this problem, this paper carried out
related experiments.
a) Experiment
Experiment 1, divided into four groups: group 1, no
feature selection, group 2, to remove all of the words
appearing in the two class concurrently, group 3, to remove
the very high frequency words appearing in both class, group
4, to remove the low frequency words appearing in both
(b) recall rate comparison of two class
class. Among them, the word appearance frequency is
calculated by DF [9], when the DF is less than two, the word Figure 4. Classification effects by increasing small class size
is regarded as low-frequency word, the experiment results
are shown in table 3. TABLE III. EXPERIMENT RESULT OF THE IMPACT OF SAME TERM IN
Experiment 2, do experiments respectively with common BOTH CLASS ON THE CLASSIFICATION
used feature selection method: DF, MI, CHI, IG[9-11], made
a number of experiments, for each method, select the best Macro-value 1 2 3 4
classification results, as shown in table 4. Macro-p(%) 82.53 73.12 81.44 81.44
b) Experimental analysis Macro-R(%) 81.14 68.32 78.48 78.48
• Experimental data is from the internet, which is Macro-F1(%) 81.83 70.64 79.93 79.93
divided into two categories according to the point of
view about the house price, both sides express their

304
TABLE IV. EXPERIMENT RESULT OF THE IMPACT OF SAME TERM IN [3] Enhui Zheng,Hong Xu,Ping Li,etc.Mining Knowledge from
BOTH CLASS ON THE CLASSIFICATION Unbalanced Data Based on v-support Vector Machine.Journal of
hejiang University (Engineering Science),2006,Vol.40,No.10,
The Total Number pp.1682-1687.
Macro Macro Macro
Methods Number of Same
-P(%) -R(%) -F1(%) [4] G.E.A.P.A.Batista,R.C.Prati,M.C.Monard.A study of the behavior of
of Words Words
several methods for balancing machine learning training
DF 86.86 86.63 86.74 282 251 data.SIGKDD Explorations Special Issue on Learning from
MI 86.4 86.45 85.42 541 541 Imbalanced Datasets,2004,Vol.6,No.1, pp.20-29.
[5] G.Weiss.Mining with rarity:a unifying framework.SIGKDD
CHI 84.55 83.7 84.12 1598 540
Explorations Special Issue on Learning from Imbalanced
IG 88.93\2 88.83 88.87 512 512 Datasets.2004,Vol.6,No.1,pp.7-19.
[6] M.V.Joshi.Learning classifier models for predicting rare
IV. CONCLUSION phenomena[Ph.D.Thesis].University of Minnesota,2002.
[7] R.C.Prati,G.E.A.P.A.Batista.Class imbalances versus class
Data imbalance problem often appear in the field of text overlapping an analysis of a learning system behavior.In Proceedings
classification currently, this paper focus on analysing several of the Mexican International Conference on Artificial
forms of data imbalance including text distribution, class size Intelligence(MICAI),2004,pp.312-321.
and class overlap, through a series of experiments, get a [8] Yanling Li, Guanzhong Dai, and Sen Qin, “Text tendency
number of important conclusions with practical value. categorization method based on class space model”, Computer
Application, 2007,Vol.27, No.9, pp.2194-2196.
[9] Liuling Dai,Heyan Huang, and Zhaoxiong Chen, “A commparative
REFERENCES study on feature selection in Chinese text categorization”, Journal of
Chinese Information Processing, 2004, Vol.18,No.1,pp.26-32
[1] N.V.Chawla,N.Jaspkowicz,A.Kotcz.Editorial:Special issue on
[10] Hailong Zhang ,and Lianzhi Wang, “Automatic text categorization
learning from imbalanced data sets.Sigkdd Explorations
feature selection methods research”,Computer Engineering and
Newsletters,2004,Vol.6,No.1, pp. 1-6.
Design, 2006,Vol.27,No.20, pp.383-3841.
[2] Yanmin Sun,S.Mohamed,and Kamel,etc.Cost-sensitive boosting for
[11] Tao Chen, and Yangqun Xie, “Literature review of feature dimension
classification of imbalanced data.Pattern Rfecognition,2007,No.40,
reduction in text categorization”, Journal of The China Society For
pp.3358-3378.
Scientific and Technical Information, 2005,Vol.24,No.6,pp.691-695.

305

Water Penetration of Metal Roof Panel Systems by Static Water Pressure Head
No ratings yet
Water Penetration of Metal Roof Panel Systems by Static Water Pressure Head
4 pages
A Systematic Review On Imbalanced Data Challenges in Machine Learning: Applications and Solutions
100% (1)
A Systematic Review On Imbalanced Data Challenges in Machine Learning: Applications and Solutions
36 pages
Creep
0% (1)
Creep
42 pages
Classification of Imbalanced Data A Review
No ratings yet
Classification of Imbalanced Data A Review
34 pages
Foundations of Data Imbalance and Solutions For A Data Democracy
No ratings yet
Foundations of Data Imbalance and Solutions For A Data Democracy
20 pages
Handling Imbalanced Ratio For Class Imbalance Problem Using SMOTE
No ratings yet
Handling Imbalanced Ratio For Class Imbalance Problem Using SMOTE
12 pages
Kumar 2021 IOP Conf. Ser. Mater. Sci. Eng. 1099 012077
No ratings yet
Kumar 2021 IOP Conf. Ser. Mater. Sci. Eng. 1099 012077
9 pages
2018 12state of ArtofImbalancedDataClassificationMethods
No ratings yet
2018 12state of ArtofImbalancedDataClassificationMethods
7 pages
A Survey On Oversampling Techniques For Imbalanced Learning
No ratings yet
A Survey On Oversampling Techniques For Imbalanced Learning
6 pages
Class Imbalance Problem in Data Mining: Review
No ratings yet
Class Imbalance Problem in Data Mining: Review
5 pages
Imbalanced Data Problem in Machine Learning A Review
No ratings yet
Imbalanced Data Problem in Machine Learning A Review
14 pages
VUTTIPITTAYAMONGKOL 2021 On The Class Overlap Problem
No ratings yet
VUTTIPITTAYAMONGKOL 2021 On The Class Overlap Problem
56 pages
Imbalanced Data
No ratings yet
Imbalanced Data
54 pages
Ensemble Models For Effective Classification of Big Data With Data Imbalance
No ratings yet
Ensemble Models For Effective Classification of Big Data With Data Imbalance
17 pages
Clustering Based Undersampling For Handling Class Imbalance in C4.5 Classification Algorithm
No ratings yet
Clustering Based Undersampling For Handling Class Imbalance in C4.5 Classification Algorithm
7 pages
IR-Lab Project of Yanjun Qi (Fall 2004) : A Brief Literature Review of Class Imbalanced Problem
No ratings yet
IR-Lab Project of Yanjun Qi (Fall 2004) : A Brief Literature Review of Class Imbalanced Problem
5 pages
Addressing Imbalance Problem in The Class - A Survey
No ratings yet
Addressing Imbalance Problem in The Class - A Survey
5 pages
A Cluster Based Under-Sampling Solution For Handling Imbalanced Data
No ratings yet
A Cluster Based Under-Sampling Solution For Handling Imbalanced Data
12 pages
An Insight Into Classification With Imbalanced Data
No ratings yet
An Insight Into Classification With Imbalanced Data
29 pages
A Unifying View of Class Overlap and Imbalance
No ratings yet
A Unifying View of Class Overlap and Imbalance
26 pages
Imbalance Factor: A Simple New Scale For Measuring Inter-Class Imbalance Extent in Classification Problems
No ratings yet
Imbalance Factor: A Simple New Scale For Measuring Inter-Class Imbalance Extent in Classification Problems
27 pages
Analysis of Imbalanced Classification Algorithms A Perspective View
No ratings yet
Analysis of Imbalanced Classification Algorithms A Perspective View
5 pages
AReviewon Oversampling Techniquesfor Solvingthe Data Imbalance Problemin Classification
No ratings yet
AReviewon Oversampling Techniquesfor Solvingthe Data Imbalance Problemin Classification
11 pages
Predictive Accuracy: A Misleading Performance Measure For Highly Imbalanced Data
No ratings yet
Predictive Accuracy: A Misleading Performance Measure For Highly Imbalanced Data
12 pages
NICE Actimize - DS - Rarity Problem in Supervised Fraud Detection Insights Article - 3JUNE20
No ratings yet
NICE Actimize - DS - Rarity Problem in Supervised Fraud Detection Insights Article - 3JUNE20
11 pages
Eng2 12298 PDF
No ratings yet
Eng2 12298 PDF
24 pages
Oligois: Scalable Instance Selection For Class-Imbalanced Data Sets
No ratings yet
Oligois: Scalable Instance Selection For Class-Imbalanced Data Sets
15 pages
8 Tactics To Combat Imbalanced Classes in Your Machine Learning Dataset
No ratings yet
8 Tactics To Combat Imbalanced Classes in Your Machine Learning Dataset
62 pages
1 s2.0 S016786551730257X Main
No ratings yet
1 s2.0 S016786551730257X Main
7 pages
A Study For The Discovery of Web Usage Patterns Using Soft Computing Based Data Clustering Techniques
No ratings yet
A Study For The Discovery of Web Usage Patterns Using Soft Computing Based Data Clustering Techniques
14 pages
Zheng 2020
No ratings yet
Zheng 2020
13 pages
Imbalanced Dataset Classification and Solutions: A Review
No ratings yet
Imbalanced Dataset Classification and Solutions: A Review
29 pages
Classifying Imbalanced Data Sets Using Similarity Based Hierarchical Decomposition
No ratings yet
Classifying Imbalanced Data Sets Using Similarity Based Hierarchical Decomposition
16 pages
Class Notes
No ratings yet
Class Notes
24 pages
1 s2.0 S0031320324006320 Main
No ratings yet
1 s2.0 S0031320324006320 Main
13 pages
Introduction To Imbalanced Datasets
No ratings yet
Introduction To Imbalanced Datasets
10 pages
Batista 2004
No ratings yet
Batista 2004
10 pages
11-A-SMOTE A New Preprocessing Approach For Highly Im
No ratings yet
11-A-SMOTE A New Preprocessing Approach For Highly Im
11 pages
Navo Minority Over-Sampling Technique (Nmote) : A Consistent Performance Booster On Imbalanced Datasets
No ratings yet
Navo Minority Over-Sampling Technique (Nmote) : A Consistent Performance Booster On Imbalanced Datasets
42 pages
Enhancing Classification Performance of Multi-Class Imbalanced Data Using The OAA-DB Algorithm
No ratings yet
Enhancing Classification Performance of Multi-Class Imbalanced Data Using The OAA-DB Algorithm
8 pages
An Overview of Classification Algorithms For Imbalanced Datasets
No ratings yet
An Overview of Classification Algorithms For Imbalanced Datasets
7 pages
Leevy2018 Article ASurveyOnAddressingHigh-classI
No ratings yet
Leevy2018 Article ASurveyOnAddressingHigh-classI
30 pages
Bagging Using Instance-Level Difficulty For Multi-Class Imbalanced Big Data Classification On Spark
No ratings yet
Bagging Using Instance-Level Difficulty For Multi-Class Imbalanced Big Data Classification On Spark
10 pages
FAST - A ROC-Based Feature Selection Metric For Small Samples and Imbalanced Data Classification Problems (2008)
No ratings yet
FAST - A ROC-Based Feature Selection Metric For Small Samples and Imbalanced Data Classification Problems (2008)
9 pages
Imbalanced Data: How To Handle Imbalanced Classification Problems
No ratings yet
Imbalanced Data: How To Handle Imbalanced Classification Problems
17 pages
Imbalanced Data Classification Method Based On LSSASMOTE
No ratings yet
Imbalanced Data Classification Method Based On LSSASMOTE
9 pages
Improving Imbalanced Learning Through A Heuristic Oversampling Method Based On K-Means and SMOTE
No ratings yet
Improving Imbalanced Learning Through A Heuristic Oversampling Method Based On K-Means and SMOTE
20 pages
Handling Data Imbalance in Machine Learning
No ratings yet
Handling Data Imbalance in Machine Learning
51 pages
MEE22154 Task2
No ratings yet
MEE22154 Task2
4 pages
International Conference On Information and Communications Technology
No ratings yet
International Conference On Information and Communications Technology
5 pages
702 1974 1 PB
No ratings yet
702 1974 1 PB
9 pages
Paper IJRITCC
No ratings yet
Paper IJRITCC
5 pages
Model Optimisation of Class Imbalanced Learning Using Ensemble Classifier On Over-Sampling Data
No ratings yet
Model Optimisation of Class Imbalanced Learning Using Ensemble Classifier On Over-Sampling Data
8 pages
d2c0 PDF
No ratings yet
d2c0 PDF
6 pages
Spam Text Detection Over Social Media Usage
No ratings yet
Spam Text Detection Over Social Media Usage
8 pages
A Novel Resampling Technique For Imbalanced Classification in Software Defect Prediction by A Re-Sampling Method With Filtering
No ratings yet
A Novel Resampling Technique For Imbalanced Classification in Software Defect Prediction by A Re-Sampling Method With Filtering
10 pages
2515-Article Text-14337-4-10-20230331
No ratings yet
2515-Article Text-14337-4-10-20230331
12 pages
1 s2.0 S0957417422003888 Main
No ratings yet
1 s2.0 S0957417422003888 Main
13 pages
Axioms 11 00607 v2
No ratings yet
Axioms 11 00607 v2
19 pages
Business Statistics I Essentials
From Everand
Business Statistics I Essentials
Louise Clark
5/5 (5)
Scientific Management of the Classroom
From Everand
Scientific Management of the Classroom
Pernell Hodges
No ratings yet
Qualitative Research:: Intelligence for College Students
From Everand
Qualitative Research:: Intelligence for College Students
Wayne L Davis
4/5 (1)
An Application Based Comparative Study of Lpwan Technologies For Iot Environment
No ratings yet
An Application Based Comparative Study of Lpwan Technologies For Iot Environment
4 pages
Energy Saving Smart Waste Segregation and Notification System
No ratings yet
Energy Saving Smart Waste Segregation and Notification System
4 pages
20
No ratings yet
20
4 pages
Pseudo-Static Master-Slave Match-Line Scheme For Sustainable-Performance and Energy-Efficient Content Addressable Memory
No ratings yet
Pseudo-Static Master-Slave Match-Line Scheme For Sustainable-Performance and Energy-Efficient Content Addressable Memory
4 pages
A New Classification Technique: Random Weighted LSTM (RWL)
No ratings yet
A New Classification Technique: Random Weighted LSTM (RWL)
4 pages
Analog Signal Processing Based Hardware Implementation of Real-Time Audio Visualizer
No ratings yet
Analog Signal Processing Based Hardware Implementation of Real-Time Audio Visualizer
5 pages
Cost-Effective Seawater Purification System Using Solar Photovoltaic
No ratings yet
Cost-Effective Seawater Purification System Using Solar Photovoltaic
4 pages
Native Vehicles Classification On Bangladeshi Roads Using CNN With Transfer Learning
No ratings yet
Native Vehicles Classification On Bangladeshi Roads Using CNN With Transfer Learning
4 pages
Predicting The Possibility of Being Malignant Tumor Based On Physical Symptoms Using Iot
No ratings yet
Predicting The Possibility of Being Malignant Tumor Based On Physical Symptoms Using Iot
5 pages
Modelling of Dried Banana Leaves Based Microwave Absorber Over 1-7 GHZ Frequency Range
No ratings yet
Modelling of Dried Banana Leaves Based Microwave Absorber Over 1-7 GHZ Frequency Range
5 pages
Improved Identification Performance of Lysine Glycation PTM Using PSI-BLAST
No ratings yet
Improved Identification Performance of Lysine Glycation PTM Using PSI-BLAST
4 pages
Two-Stage Classification Methods For Microarray Data: Tzu-Tsung Wong, Ching-Han Hsu
No ratings yet
Two-Stage Classification Methods For Microarray Data: Tzu-Tsung Wong, Ching-Han Hsu
9 pages
BMC Bioinformatics: Gene Selection and Classification of Microarray Data Using Random Forest
No ratings yet
BMC Bioinformatics: Gene Selection and Classification of Microarray Data Using Random Forest
13 pages
Dealing With Data Imbalance in Text Classification New
No ratings yet
Dealing With Data Imbalance in Text Classification New
10 pages
Thesis On Gene Expression Analysis
No ratings yet
Thesis On Gene Expression Analysis
125 pages
Deep Learning and Thresholding With Class-Imbalanced Big Data
No ratings yet
Deep Learning and Thresholding With Class-Imbalanced Big Data
8 pages
Sequence Classification
No ratings yet
Sequence Classification
9 pages
The High Line Hates Artists
No ratings yet
The High Line Hates Artists
4 pages
Determination of Caffeine in Tea Samples
No ratings yet
Determination of Caffeine in Tea Samples
7 pages
TEC-030100.2-MET-DoR-002-Fosroc1A-Renderoc FC (Fairing Coat) (1 Component Polymer Modifyied Cementitious
No ratings yet
TEC-030100.2-MET-DoR-002-Fosroc1A-Renderoc FC (Fairing Coat) (1 Component Polymer Modifyied Cementitious
4 pages
ME8097 - Non Destructive Testing and Evaluation
100% (1)
ME8097 - Non Destructive Testing and Evaluation
16 pages
HOA314N: Activity 2: Vernacular Houses
No ratings yet
HOA314N: Activity 2: Vernacular Houses
8 pages
Madhubhan Rejou Spa Services Menu
No ratings yet
Madhubhan Rejou Spa Services Menu
10 pages
Atlantic International University - Wikipedia
No ratings yet
Atlantic International University - Wikipedia
4 pages
Chapter 2 Different Types of Fixtures
No ratings yet
Chapter 2 Different Types of Fixtures
20 pages
Struts Survival Guide
No ratings yet
Struts Survival Guide
227 pages
Avasa Hotel, Hydreabad
75% (4)
Avasa Hotel, Hydreabad
12 pages
Hate Speech, 2016 Report
No ratings yet
Hate Speech, 2016 Report
60 pages
Scientific Writing of Research: Dr. Aman Ullah, PH.D
No ratings yet
Scientific Writing of Research: Dr. Aman Ullah, PH.D
38 pages
San Ildefonso College: Table of Specification
No ratings yet
San Ildefonso College: Table of Specification
11 pages
5 PDF
No ratings yet
5 PDF
1 page
Vamshi Krishna Resume
No ratings yet
Vamshi Krishna Resume
4 pages
Marketing Principles
No ratings yet
Marketing Principles
54 pages
Growing Up in Bali 30 Years Ago
No ratings yet
Growing Up in Bali 30 Years Ago
3 pages
Course Catalogue and Timetable 1st-2nd Semester - Dept of Engineering Enzo Ferrari - AA2023-2024
No ratings yet
Course Catalogue and Timetable 1st-2nd Semester - Dept of Engineering Enzo Ferrari - AA2023-2024
2 pages
SME - Metal Enclosed Switchgears
No ratings yet
SME - Metal Enclosed Switchgears
4 pages
Petrifilm Salmonella Express SALX Interpretation Guide - en US - FS00587
No ratings yet
Petrifilm Salmonella Express SALX Interpretation Guide - en US - FS00587
6 pages
KDP Amazon
100% (1)
KDP Amazon
7 pages
Class Action Filed B John Fergusson, Kelli Beaugez and Gregory Stenstrom Against Apple, January 9, 2018
No ratings yet
Class Action Filed B John Fergusson, Kelli Beaugez and Gregory Stenstrom Against Apple, January 9, 2018
24 pages
Spectrum of Imaging Findings in Pulmonary Infections Part 1&2
No ratings yet
Spectrum of Imaging Findings in Pulmonary Infections Part 1&2
19 pages
Angew Chem Int Ed - 2017 - Choi
No ratings yet
Angew Chem Int Ed - 2017 - Choi
5 pages
Semi - NCM 101
100% (1)
Semi - NCM 101
13 pages
Spark Fun
No ratings yet
Spark Fun
1 page
How Much Power
No ratings yet
How Much Power
5 pages
SAP Material Training
No ratings yet
SAP Material Training
37 pages

Text Classification Paper 1

Uploaded by

Text Classification Paper 1

Uploaded by

Third International Symposium on Information Processing

Data imbalance problem in text classification

Yanling Li, Guoshe Sun Yehang Zhu

978-0-7695-4261-4/10 $26.00 © 2010 IEEE 301

III. IMPACT ANALYSIS OF DATA IMBALANCE TO

You might also like