0% found this document useful (0 votes)
50 views8 pages

A New Text Mining Approach Based On HMM-SVM For Web News Classification

medicine journal

Uploaded by

Mj Erwin
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
50 views8 pages

A New Text Mining Approach Based On HMM-SVM For Web News Classification

medicine journal

Uploaded by

Mj Erwin
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 8

See discussions, stats, and author profiles for this publication at: https://www.researchgate.

net/publication/43785299

A New Text Mining Approach Based on HMM-SVM for Web News Classification

Article  in  International Journal of Computer Applications · February 2010


DOI: 10.5120/395-589 · Source: DOAJ

CITATIONS READS

10 259

3 authors, including:

G. Krishnalal
Amal Jyothi College of Engineering
6 PUBLICATIONS   35 CITATIONS   

SEE PROFILE

Some of the authors of this publication are also working on these related projects:

Cloud Computing View project

Image Forensics View project

All content following this page was uploaded by G. Krishnalal on 12 October 2015.

The user has requested enhancement of the downloaded file.


©2010 International Journal of Computer Applications (0975 - 8887)
Volume 1 – No. 19

A New Text Mining Approach Based on HMM-SVM for Web


News Classification
Krishnalal G S Babu Rengarajan K G Srinivasagan
Senior Lecturer, CSE Professor & Head of IT Associate Professor, CSE
SJCET PET Engineering College, Vallioor, National Engineering College
Palai, Kottayam, Kerala, India Tamil Nadu, India. Kovilpatti, Tamil Nadu, India

ABSTRACT
Since the emergence of WWW, it is essential to handle a very suggests that automated text categorization techniques are
large amount of electronic data of which the majority is in the reaching a level of performance at which they can compete with
form of text. This scenario can be effectively handled by various humans not only in terms of cost-effectiveness and speed, but also
Data Mining techniques. This paper proposes an intelligent in terms of accuracy of classification.
system for online news classification based on Hidden Markov
Model (HMM) and Support Vector Machine (SVM). An The rest of the paper is organized as follows: section 2 describes
intelligent system is designed to extract the keywords from the related works, section 3 provides the basics of text classification
online news paper content and classify it according to the pre procedure, section 4 presents the proposed Multi class classifier
defined categories. Three different stages are designed to classify based on HMM-SVM, and section 5 displays the experimental
the content of online newspapers such as (1) Text pre-processing results followed by concluding remarks.
(2) HMM based Feature Extraction and (3) Classification using
SVM. Data have been collected for experimentation from The 2. RELATED WORKS
Hindu, The New Indian Express, Times of India, Business Line, Before going into the new intelligent classification system, it is
and The Economic Times. The experimental results are based on essential to have an overview of the various existing
the news categories such as sports, finance and politics and their methodologies.
accuracies in percentage are 92.45, 96.34 and 90.76 respectively.
These results are very good compared to that of other text
classification methods. 2.1 Manual and Fuzzy Text Classification
Much useful information is in the form of text: This ranges from
Keywords emails, web pages, newspaper articles, market research reports,
Feature Extraction, HMM, kNN, POS, SVM. through to CVs, complaint letters from customers, and internally
generated reports [18]. As far as the online news papers is
concerned, the system is supposed to provide news under various
1. INTRODUCTION categories like national, international, regional, politics, finance,
News articles on topical issues are helpful for company managers sports, entertainment etc.
and other decision-makers. However, due to the sheer number of
news articles published, it is a time-consuming task to select the
most interesting one. Therefore, a method of news-article In earlier days of online newspapers, the classification and
categorization is essential to obtain the relevant information indexing of news is done manually. Classifying and indexing
quickly. news reports by hand were found to be an expensive, slow and
labor-intensive activity.
In order to develop such a text-classification system, many
researchers have devoted their work for automating the text
classification task. A news-story categorization system is Consistent accuracy was difficult to obtain with human indexers,
developed, where a rule base is generated by human expertise. and the work tended to cause high staff turnover. With these
Building such a system requires huge efforts from the indexing issues in mind, Carnegie Group, based in Pittsburgh, worked with
experts, taking more than a couple of months because of the Reuters [27] to develop the Construe system, an automated news
enormous number of rules. categorization system based on a fuzzy rule-based text
categorization.
On the other hand, a statistical approach based on keyword
extraction from training texts is a popular method of generating a
knowledge base [3]. This has the advantage that the knowledge 2.2 Automated Text Categorization
base can be quickly generated without much cost. However, we
In the last 15 years or so, substantial research has been conducted
need guidelines on how to gather a large quantity of training texts.
on text classification through supervised machine learning
As for automated text categorization, our results are interesting. It
techniques [13], [15]. The vast majority of studies in this area
shows that relatively knowledge-poor machine learning algorithm
focus on classification by topic, where bag-of-content-word
outperforms human beings in a text classification task. This
models turn out to be very effective. Recently, there has also been
increasing interest in automated categorization by overall

98
©2010 International Journal of Computer Applications (0975 - 8887)
Volume 1 – No. 19

sentiment, degree of subjectivity, authorship and along other Evaluation of classification results: One or more categories are
dimensions that can be grouped together under the cover terms of assigned to each text. Then, the "classification correctness"
―genre‖ and ―style‖ [4]. ("recall" and "precision") is calculated [7, 8].
Genre and style classification tasks cannot be tackled using only
the simple lexical cues that have been proven so effective in topic Results analysis and knowledge base refinement: If the degree of
detection. For instance, an objective report and a subjective correctness is not efficient enough to achieve the system‘s goals,
editorial about the Iraq war will probably share many of the same then the knowledge base will require refinement.
content words; vice versa, objective reports about Iraq and soccer
will share very few interesting content words. Thus,
categorization by genre and style must rely on more abstract 4. PROPOSED METHODOLOGY
topic-independent features. Popular choices of features [12] [14] The proposed intelligent news classifier is designed and
have been function words (that are usually discarded or down- developed for extracting the full potential of HMM and SVM
weighted in topic based categorization), textual statistics (e.g., methods. The general architecture of the proposed system is as
average sentence length, lexical richness measures), n-grams and shown in Figure 1.
part-of-speech (POS) information. There are some other
techniques like k-Nearest Neighbour (kNN) for text classification.
But when considering the accuracy factor of classification, these
algorithms are not enough to be used for news classification in
domains such as online newspapers or online journals since the
accuracy is having prime importance in these scenarios.

3. TEXT CLASSIFICATION PROCEDURE

The following steps are essential for the successful development


of an intelligent classification system.

Goal definition: The goals of the classification system should


adhere to the following requirements:

a. To be a fully automated system or a classification Fig.1. General Architecture


supporting system.
b. A number of categories are to be assigned to each 4.1 Pre-Processing
incoming text (one or plural). News articles in the form of text files are fed into the system using
c. Classification correctness. JFileChooser API. It then undergoes Parsing. Parsing is carried
out using the Java StringTokenizer. Parsing is the division of text
d. Computing time needs be limited when generating a
into a set of discrete parts, or tokens, which in a certain sequence
knowledge base and when classifying each text.
can convey a semantic meaning [29]. The String Tokenizer class
provides the first step in this parsing process, often called the
Category definition: The categories to be classified should be lexer (lexical analyzer) or scanner. StringTokenizer implements
defined beforehand. Ideally each category should be exclusive to the Enumeration interface. Therefore, given an input string, it can
each other in the keyword distribution. enumerate the individual tokens contained in it.

Text collection: A large number of texts must be prepared, some To use String Tokenizer, specify an input string that contains
of which are used for training the system, while others are used delimiters. Delimiters are characters that separate tokens. For
for evaluation purpose. Each text should be assigned to the example, ", ; :" sets the delimiters to a comma, semicolon, and
relevant categories beforehand. colon. The default set of delimiters consists of the whitespace
characters: space, tab, newline, and carriage return.
Statistical analysis of texts: Statistical analysis of the texts is very
important, for example text length, the number of texts in each The words obtained from the tokenizer form the basis for the
category, and keyword distribution in each text or in all texts. feature space of the training data. However, a fair amount of pre-
processing not only prunes down the training size, but also makes
the data more ‗clean‘ and capable of training the classifier more
Knowledge base generation: A knowledge base is generated using effectively [22].
training texts. Weighted keywords [30], which define the
character of each category, are extracted and stored in the The conversion of the entire text to lower-case and removal of
knowledge base. non-alphanumeric contents are followed by Stop-word elimination
and grammatical stemming. Many text classification systems use

99
©2010 International Journal of Computer Applications (0975 - 8887)
Volume 1 – No. 19

stopword list to delete "noise" words [33] before going to the a. Quantity: To generate a knowledge base from a large
classification algorithm. The system examined two types of quantity of classified training articles takes much
stopwords: (a) the words appearing in every category (these words computing time [10]. However, the fewer the training
depend on text domains, the number of categories, and the volume articles the worse the classification correctness
of the training texts), and (b) the words which do not apparently becomes.
characterize any category, examples of which are ‗the‘, ‗is‘, b. Publication date time-lapse between training and
‗doing‘, ‗have‘, etc that appear very frequently in all the evaluation articles: The topics of the news articles are
documents, but almost always carry no useful information. continually changing and new words are constantly
The main interest is in words that most frequently characterize the appearing. We assume that the classification correctness
news into any one of the classes and virtually assure that a given would be better if the time elapsed between the
news article belongs to a particular category only. publication dates of the training and the evaluation
articles were closer. This parameter is also relevant to
The next step is to go in for grammatical stemming [25]. A continually maintaining the quality of the knowledge
stemmer is an algorithm that determines a stem form of a given base.
word. The stem need not be identical to the morphological root of
the word; it is usually sufficient that related words map to the
same stem, even if this stem is not in itself a valid root. The 4.2 Hidden Markov Model
purpose of stemming is to map different forms of an identically A method of feature selection in texts and an effective technique
implying word to the same feature in the classifier‘s training of classifying them are described in this part [40]. When
space. classifying texts, words included in them are used as classification
Since the knowledge base plays an important role in this type of features [21]. Undoubtedly, Markovian models are now regarded
classification system, the method of training set selection, as one of the most significant state-of-the-art approaches for
generation and refinement of knowledge base are crucial problems sequence learning. Besides several applications in pattern
[20]. In this paper, for building a knowledge base, the following recognition and molecular biology, HMMs have also been applied
parameters are considered. to text related tasks, including natural language processing [7]
and, more recently, information retrieval and extraction [14, 36].
The recent view of the HMM as a particular case of Bayesian
4.1.1 Treatment of stopword
networks [11], has helped their theoretical understanding and the
ability to conceive extensions to the standard model in a sound
Though the necessity of stopwords has rather been discussed [5], and formally elegant framework. HMMs are used in this project
[6], the creation of training set also involves the removal of for the feature extraction and primary classification of the given
stopwords. input news.
We consider that a text is a sequence of observations O=
4.1.2 Keyword weighting (O1,…,Ot). The observations Ot correspond to the tokens of the
text. Technically, each token is a vector of attributes generated by
A set of noun words, verb words and unit symbols appearing in an a collection of NLP tools. We should attach a semantic tag X i to
article as ‗keywords‘ are defined here [19], [26]. The system some of the tokens Ot. An extraction algorithm maps an
observation sequence O1,…, Ot to a single sequence of tags
adopts the frequency of the keyword [28,38] as its assigned
weight [30]. These weights may be normalized between where .
categories in order to adjust to any difference in the quantity of
training articles assigned to each category. An HMM =( ,A,B) consists of finitely many states {S1,…,Sn}
with probabilities I = P(q1 = Si) , the probability of starting in
state Si and aij = P(qt+1 = Sj | qt = Si ), the probability of transition
4.1.3 Scoping for keywords in a defined area from state Si to Sj . Each state is characterized by a probability
distribution bi (Ot) = P (Ot |qt= Si) over observations. Given an
As it is well known, most keywords are included in the headline observation sequence O= (O1,…,OT), and according to Bayes‘
principle, for each observation Ot, we have to return the tag Xi
and within the first paragraph of any news article [37]. Therefore,
defining which areas in an article are to be checked for extracting which maximizes the probability P ( t X i O ) , which means
keywords may be more effective. that we should identify a sequence of states q 1,…,qT which
maximize P(qt = Si | O, ) and return that tag Xi that corresponds
Selection of training articles plays a vital role in the over all to the state Si for each token Ot.
performance of the classifier. If training articles are not selected The forward-backward algorithm of HMM is used further.
appropriately, the classification system will be of no use even if
the classification algorithm is of excellent quality. In this paper, The algorithm comprises three steps:
the following two kinds of parameters are considered in selecting
training articles:
1. Computing forward probabilities
2. Computing backward probabilities

100
©2010 International Journal of Computer Applications (0975 - 8887)
Volume 1 – No. 19

3. Computing smoothed values Baum- Welch algorithm solves it, which estimate the parameters
using the following formulae

αt (i) P(qt Si , O1,…, Ot | λ ) is the forward variable. It k Tk 1 k


quantifies the probability of reaching state S i at time t and k 1 t 1 t (i, j )
observing the initial part O1,…, Ot of the observation aij (1)
sequence. . βt (i) P(Ot 1,… , OT | qt Si , λ ) is the backward k Tk 1 k
k 1 t 1 t (i )
variable and quantifies the chance of observing the rest sequence
Ot 1,… , OT when in state Si at time t. Of course, αt (i) and βt (i)
can be computed and then we can express the probability of being
k Tk k
in state Si at time t given observation sequence O, it is γ t (i) k 1 t (i )
t 1,Otk vn

αt (i)βt (i) / P(O | λ ) b j (i ) (2)


k Tk k
k 1 t 1 t ( j)
The extraction using HMM can be described as follows.
Input text T = (W1,…,Wt); HMM ; set of tags X1,…,Xn
k k
corresponding to target HMM states S1,…,Sn Where t (i , j ) and t (i ) are the joint event and the state
Generate sequence O = (O1,…,Ot), where Ot is a vector containing variable associated with the kth observation sequence respectively
word wt. [34]. The feature extraction using HMM is clearly depicted in
Call Forward-Backward algorithm and calculate Figure 2.

If then output ―<Xi>wt<Xi>‖;


else output wt.
After the feature extraction process, the output can be normalized
into a new feature vector, and then the trained SVM classifier is
ready to be used for classifying a new text.

Consider K classes l = {l1,…, lk} with their respective HMM set


λ′ = {λ1…, λk} , where λi = {λi1…, λiNi} So the total number of
k
HMM N Ni .
i 1

Compute probability, P (O| λ) for a group of HMM λi = {λi1…, Figure 2. HMM Based Feature Extraction
λiNi}.
Suppose,
4.3 Support Vector Machine.
i Support Vector Machine (SVM) is a classification technique that
Pm axli max P(O j ) was first applied to text categorization by Joachims [1, 9]. It is a
1 j Nt
powerful supervised learning paradigm based on the structured
Where the label li corresponds the maximum probability in the risk minimization principle. During training, this algorithm
group. Of course, we get the feature vector g= {w1… wt… wm} constructs a hyperplane that maximally separates the positive and
with the HMM λij . Then li and g are combined into a new feature negative instances in the training set. Classification of new

vector g {li , g} . Normalize the obtained new feature vector instances is then performed by determining which side of the
~
hyperplane they fall on [16].
by g g
g
2 Most of the previous studies that apply SVM to text categorization
use all the words in the document collection without any attempt
and this plays an important role in SVM classification for pre- to identify the important keywords [17]. On the other hand, there
processing. are various remarkable studies on keyword selection for text
categorization in the literature [23]. As stated above, these studies
mainly focus on keyword selection metrics and employ either the
The above section described a clear idea about the working of corpus-based or the class-based keyword selection approach, do
HMM as a feature Extractor. Next comes the learning of
not use standard datasets, and mostly lack a time complexity
parameters λ of an HMM from a set O {O1, O2 ,…, OK } of analysis of the proposed methods [32, 39]. In addition, most
example sequences [41]. Applying another algorithm called studies do not use SVM as the classification algorithm. For

101
©2010 International Journal of Computer Applications (0975 - 8887)
Volume 1 – No. 19

instance, Yang [12] and Pedersen [21] use kNN, and Mladenic for every pair of distinct classes i and j [35]. Each classifier C ij is
and Grobelnic [22] use Naive Bayes [31] in their studies on trained with samples in the i th class with positive labels and the
keyword selection metrics. Later studies reveal that SVM samples in the jth class with negative labels. The decision
performs consistently better than these classification algorithms. function used in this classification is given by
Assume a group of examples {(x1, y1), (x2, y2),.., (xk, yk)} where ~ ~
li l j
~
li l j li l j
li l j
xk R and yk 1, 1 . We consider decision functions f ij ( g ) y ( g , g) b (8)
n n
of the form sgn ((w.x) + b), where (w.x) denotes the inner product n
of w and x. So a decision function fw, b should be found with the
properties i j, i 1,...,M
Where is the total number of the i th and jth classes from the
(3) training data? We can see that SVM method described above is
one-against-one method [2]. When given an unknown sample, if
But in many cases, the separating hyper plane does not exit. To the decision function predicted that the sample belongs to class i,
allow for possibilities for violating Equation (3), slack variables the classifier Cij attributes one vote for that class, otherwise the
like vote is attributed to class j. When all the votes from all the binary
classifiers are obtained, the unknown sample belongs to the class
(4) having the highest votes.

Subject to
5. EXPERIMENTAL RESULTS
To evaluate the effectiveness of the new classification method
are applied. proposed in this paper, we choose a text set, which consists of
1200 news, taken from The Hindu [42], The New Indian Express
The above minimization problem is a constrained quadratic [43], Business Line [44], The Economic Times [45], and Times of
programming (QP) problem, which can be formulated as the India [46], distributed among three major categories such as
following convex QP problem: sports, finance, and politics. In our experiment, 800 texts are used
k
1 k k as training set, and the rest 400 texts are used as testing set. The
Maximize ai ai a j yi y j ( xi x j ) (5) distribution of the test news from various news sites is in Table 1.
i 1 2 i 1 j 1

k Table 1. Test news distribution among various classes


Subject to yi i 0,0 i C (i 1,..., k ) Source
The
i 1 The Times
The New Business
Economic of Total
Where αi are Lagrange multipliers, C is a parameter that assigns Hindu Indian Line
Times India
Express
penalty cost to misclassification of samples. By solving the above Classes
QP problem, the solution gives rise to a decision function of the
Sports 58 30 0 0 18 106
form:
Finance 30 22 40 45 27 164
k
(6) Politics 48 31 5 5 41 130
f ( x) sgn y ( xi x) b
i i
i 1 Total 136 83 45 50 86 400
Where b is a bias term. Only a small fraction of the coefficient α i
is nonzero. The corresponding pairs of entries are known as HMMs are trained to extract features from texts of every class,
support vectors and they fully define the decision function [5, 6]. and multi class SVMs are trained to find the separating decision
Thus, the above decision function is expressed as an inner product hyper plane that maximizes the margin of the classified
of the data. This observation leads to the generalization to the categories. Out of the 106 sports news input, 98 were classified
nonlinear case, which is achieved by mapping the problem data to correctly. Among the 164 finance news input, 158 news were
a higher dimensional space H(feature space) through a correctly classified and out of the 130 politics test news 118 were
transformation of the form x i.xj=Ø(xi) Ø(xj), the mapping function classified as politics news.
is implicitly defined through a symmetric positive definite kernel
function K(xi, xj)= Ø(xi) Ø(xj). Then the decision function Up on analyzing the misclassified news we found that there exists
can be rewritten as: some sort of ambiguity in the text features in the news input. For
instance the terror attack on Sri Lankan cricket players at Lahore
k
contains some features that can mislead our classifier to classify
f ( x) sgn y K ( xi , x) b
i i
(7) the news as sports news, even though it belongs to other
i 1 (International) category. The classification accuracy of the
A training algorithm of multi-class SVM can be described as the proposed system for categories Sports, Finance and Politics are
task of constructing a number of binary SVMs; one classifier Cij 92.45%, 96.34% and 90.76% respectively. We compared the

102
©2010 International Journal of Computer Applications (0975 - 8887)
Volume 1 – No. 19

classification accuracy of this method with that of kNN and SVM, International Conference on Acoustics, Speech and Signal
and the test results are shown in Table 2. Processing (ICASSP03), April 6-10 2003.
[3] D. Tao, X. Tang, X. Li, and X. Wu. Asymmetric Bagging
and Random Subspacing for Support Vector Machines-based
Table 2. News Classification accuracy of three methods
Relevance Feedback in Image Retrieval, IEEE Transactions
Method KNN SVM HMM- on Pattern Analysis and Machine Intelligence, 2006.
(%) (%) SVM [4] Lei Tang, Huan Liu. Bias Analysis in Text Classification for
Category (%) Highly Skewed Data, Proceedings of Fifth IEEE
International Conference on Data Mining (ICDM'05), 2005,
Sports 83.25 87.67 92.45 pp. 781-784.
Finance 80.22 82.57 96.34 [5] J. Brank & M. Grobelnik, N. Milic-Frayling, D. Mladenic.
Politics 82.26 86.55 90.76 Training text classifiers with SVM on very few positive
examples. Microsoft Research technical report MSR-TR-
2003-34. 2003.
The proposed method‘s classification accuracy is plotted with that [6] D. Lin & P. Pantel. Discovery of Inference Rules for
of kNN and SVM classifiers in Figure 3. Question Answering. Natural Language Engineering 2001
7(4):343-360.
[7] D. Lin & P. Pantel. Induction of Semantic Classes from
News classification of various methods Natural Language Text. In Proceedings of ACM SIGKDD
Conference on Knowledge Discovery and Data Mining 2001.
100 pp. 317-322.
95
[8] E.Voorhees. Query expansion using lexical-semantic
Accuracy

90 Sports relations. In Proceedings of the 17th Annual International


85 Finance ACM SIGIR Conference on Research and Development in
80 Politics Information Retrieval, 1994, pp. 61—69.
75 [9] Joachims, T.: Text Categorization with Support Vector
70 Machines: Learning with Many Relevant Features. European
(%) (%) (%) Conference on Machine Learning (ECML) (1998) Text
KNN SVM HMM-SVM Categorization with Class-Based and Corpus-Based
Methods Keyword Selection.
[10] Ozgur, L., Gungor, T., Gurgen, F.: Adaptive Anti-Spam
Figure 3. Graphical representation of classification accuracy Filtering for Agglutinative Languages. A Special Case for
of various methods Turkish, Pattern Recognition Letters, 25 no.16 (2004) 1819–
1831.
[11] McCallum, A., Nigam, K.: A Comparison of Event Models
6. CONCLUSION for Naive Bayes Text Classification. Sahami, M. (Ed.), Proc.
The intelligent News Classifier is developed and experimented of AAAI Workshop on Learning for Text Categorization
with online news from web for the category Sports, Finance and (1998), Madison, WI, 41–48.
Politics. The novel approach combining two powerful algorithms, [12] Yang, Y., Liu, X.: A Re-examination of Text Categorization
Hidden Markov Model and Support Vector Machine, in the online Methods. In Proceedings of SIGIR-99, 22nd ACM
news classification domain provides extremely good result International Conference on Research and Development in
compared to existing methodologies. By the introduction of Information Retrieval, Berkeley, US (1996).
several preprocessing techniques and the application of filters we [13] Sebastiani, F.: Machine Learning in Automated Text
reduced the noise to a great extent, which in turn improved the Categorization. ACM Computing Surveys 34 no. 5 (2002) 1–
classification accuracy. Preprocessing in the training data set 47.
significantly reduced the training computational time. The
experimental result shows the performance of this new approach [14] Forman, G.: An Extensive Empirical Study of Feature
compared to the existing techniques. Selection Metrics for Text Classi.cation. Journal of Machine
Learning Research 3 (2003) 1289–1305.
[15] Ozgur, A.: Supervised and Unsupervised Machine Learning
7. REFERENCES Techniques for Text Document Categorization. Master‘s
Thesis (2004), Bogazici University, Turkey.
[16] Burges, C. J. C.: A Tutorial on Support Vector Machines for
[1] T. Joachims, Learning to Classify Text using Support Vector
Pattern Recognition. Data Mining and Knowledge Discovery
Machines, Kluwer, 2002.
Vol. 2 No. 2 (1998) 121–167.
[2] R. Yan, Y. Liu, A. Hauptmann. On Predicting Rare Classes
with SVM Ensembles in Scene Classification. IEEE

103
©2010 International Journal of Computer Applications (0975 - 8887)
Volume 1 – No. 19

[17] Joachims, T.: Advances in Kernel Methods-Support Vector Symposium on IT in Medicine and Education, 2008. ITME
Learning. chapter Making Large-Scale SVM Learning 2008, 12-14 Dec. 2008.
Practical MIT-Press (1999). [33] Agarwal, S.; Godbole, S.; Punjani, D.; Shourya Roy; How
[18] Lin, S-H., Shih C-S., Chen, M. C., Ho, J-M.: Extracting Much Noise Is Too Much: A Study in Automatic Text
Classi.cation Knowledge of Internet Documents with Mining Classification, Seventh IEEE International Conference on
Term Associations: A Semantic Approach. In Proc. of Data Mining, 2007. ICDM 2007.
ACM/SIGIR (1998), Melbourne, Australia 241–249. [34] Makrehchi, M.; Kamel, M.S.; Combining feature ranking for
[19] Azcarraga, A. P., Yap, T., Chua, T. S.: Comparing Keyword text classification, IEEE International Conference on
Extraction Techniques for Websom Text Archives. Systems, Man and Cybernetics, 2007. ISIC. 7-10 Oct. 2007.
International Journal of Artificial Intelligence Tools 11 no. 2 [35] Guifa Teng; Yihong Liu; Jianbin Ma; Fang Wang; Huiting
(2002). Yao; Improved Algorithm for Text Classification Based on
[20] Aizawa, A.: Linguistic Techniques to Improve the TSVM, First International Conference on Innovative
Performance of Automatic Text Categorization. In Computing, Information and Control, 2006. ICICIC '06.
Proceedings of 6th Natural Language Processing Pacific Rim Volume 2, Aug. 30 2006-Sept. 1 2006.
Symposium (2001), Tokyo, JP 307–314. [36] Hui He; Bo Chen; Jun Guo; Semi-supervised Chinese
[21] Yang, Y., Pedersen J. O.: A Comparative Study on Feature compound word extraction based on HMM, 7th World
Selection in Text Categorization. In Proceedings of the 14th Congress on Intelligent Control and Automation, 2008.
International Conference on Machine Learning (1997) 412– WCICA 2008. 25-27 June 2008.
420. [37] Wei Hu; Dong-Mo Zhang; Huan-Ye Sheng; Vague events-
[22] Mladenic, D., Grobelnic, M.: Feature Selection for based Chinese Web news classification, Proceedings of 2004
Unbalanced Class Distribution and Naive Bayes. In International Conference on Machine Learning and
Proceedings of the 16th International Conference on Cybernetics, 2004.
Machine Learning (1999) 258–267. [38] Kroha, P.; Baeza-Yates, R.; A Case Study: News
[23] Salton, G., Yang, C., Wong, A.: A Vector-Space Model for Classification Based on Term Frequency, Proceedings.
Automatic Indexing. Communications of the ACM 18 no.11 Sixteenth International Workshop on Database and Expert
(1975) 613–620. Systems Applications, 2005. 26-26 Aug. 2005.
[24] ftp://ftp.cs.cornell.edu/pub/smart/ (2004). [39] Lisbon, Proceedings of the ACM first Ph.D. workshop in
[25] Porter, M. F.: An Algorithm for Suffix Stripping. Program 14 Information and Knowledge Management, Portugal, 2007.
(1980) 130–137. [40] Jun-Peng Bao Jun-Yi Shen Xiao-Dong Liu Qin-Bao
[26] Salton, G., Buckley, C.: Term Weighting Approaches in Song , A new text feature extraction model and its
Automatic Text Retrieval. Information Processing and application in document copy detection, I International
Management 24 no. 5 (1988) 513–523 Conference on Machine Learning and Cybernetics, 2003.

[27] Lewis, D. D.: Reuters-21578 Document Corpus V1.0, [41] Harriman, Feature selection and feature extraction for text
http://kdd.ics.uci.edu/databases/reuters21578/reuters21578.ht categorization, Proceedings of the workshop on Speech and
ml. Natural Language, Human Language Technology
Conference archive.
[28] Kroha, P.; Baeza-Yates, R. A Case Study: News
Classification Based on Term Frequency , Sixteenth [42] www.hinduonnet.com
International Workshop on Database and Expert Systems [43] www.expressbuzz.com
Applications, 2005. Proceedings. [44] www.thehindubusinessline.com
[29] Lin Lv; Yu-Shu Liu; Research of English text classification [45] http://economictimes.indiatimes.com
methods based on semantic meaning: ITI 3rd International
Conference on information and Communications [46] http://timesofindia.indiatimes.com
Technology, 2005. Enabling Technologies for the New
Knowledge Society.
[30] Islam, Md. Rafiqul; Islam, Md. Rakibul; An effective term
weighting method using random walk model for text
classification, 11th International Conference on Computer
and Information Technology, 2008. ICCIT 2008. 24-27 Dec.
2008.
[31] Sang-Bum Kim; Kyoung-Soo Han; Hae-Chang Rim; Sung
Hyon Myaeng; Some Effective Techniques for Naive Bayes
Text Classification, IEEE Transactions on Knowledge and
Data Engineering, Volume 18, Issue 11, Nov,2006.
[32] Miao Zhang; De-xian Zhang; Trained SVMs based rules
extraction method for text classification, IEEE International

104

View publication stats

You might also like