0% found this document useful (0 votes)

50 views8 pages

A New Text Mining Approach Based On HMM-SVM For Web News Classification

medicine journal

Uploaded by

Mj Erwin

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

50 views8 pages

A New Text Mining Approach Based On HMM-SVM For Web News Classification

medicine journal

Uploaded by

Mj Erwin

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 8

See discussions, stats, and author proﬁles for this publication at: https://www.researchgate.

net/publication/43785299

A New Text Mining Approach Based on HMM-SVM for Web News Classiﬁcation

Article in International Journal of Computer Applications · February 2010

DOI: 10.5120/395-589 · Source: DOAJ

CITATIONS READS

10 259

3 authors, including:

G. Krishnalal
Amal Jyothi College of Engineering
6 PUBLICATIONS 35 CITATIONS

SEE PROFILE

Some of the authors of this publication are also working on these related projects:

Cloud Computing View project

Image Forensics View project

All content following this page was uploaded by G. Krishnalal on 12 October 2015.

The user has requested enhancement of the downloaded ﬁle.

A New Text Mining Approach Based on HMM-SVM for Web

News Classification
Krishnalal G S Babu Rengarajan K G Srinivasagan
Senior Lecturer, CSE Professor & Head of IT Associate Professor, CSE
SJCET PET Engineering College, Vallioor, National Engineering College
Palai, Kottayam, Kerala, India Tamil Nadu, India. Kovilpatti, Tamil Nadu, India

ABSTRACT
Since the emergence of WWW, it is essential to handle a very suggests that automated text categorization techniques are
large amount of electronic data of which the majority is in the reaching a level of performance at which they can compete with
form of text. This scenario can be effectively handled by various humans not only in terms of cost-effectiveness and speed, but also
Data Mining techniques. This paper proposes an intelligent in terms of accuracy of classification.
system for online news classification based on Hidden Markov
Model (HMM) and Support Vector Machine (SVM). An The rest of the paper is organized as follows: section 2 describes
intelligent system is designed to extract the keywords from the related works, section 3 provides the basics of text classification
online news paper content and classify it according to the pre procedure, section 4 presents the proposed Multi class classifier
defined categories. Three different stages are designed to classify based on HMM-SVM, and section 5 displays the experimental
the content of online newspapers such as (1) Text pre-processing results followed by concluding remarks.
(2) HMM based Feature Extraction and (3) Classification using
SVM. Data have been collected for experimentation from The 2. RELATED WORKS
Hindu, The New Indian Express, Times of India, Business Line, Before going into the new intelligent classification system, it is
and The Economic Times. The experimental results are based on essential to have an overview of the various existing
the news categories such as sports, finance and politics and their methodologies.
accuracies in percentage are 92.45, 96.34 and 90.76 respectively.
These results are very good compared to that of other text
classification methods. 2.1 Manual and Fuzzy Text Classification
Much useful information is in the form of text: This ranges from
Keywords emails, web pages, newspaper articles, market research reports,
Feature Extraction, HMM, kNN, POS, SVM. through to CVs, complaint letters from customers, and internally
generated reports [18]. As far as the online news papers is
concerned, the system is supposed to provide news under various
1. INTRODUCTION categories like national, international, regional, politics, finance,
News articles on topical issues are helpful for company managers sports, entertainment etc.
and other decision-makers. However, due to the sheer number of
news articles published, it is a time-consuming task to select the
most interesting one. Therefore, a method of news-article In earlier days of online newspapers, the classification and
categorization is essential to obtain the relevant information indexing of news is done manually. Classifying and indexing
quickly. news reports by hand were found to be an expensive, slow and
labor-intensive activity.
In order to develop such a text-classification system, many
researchers have devoted their work for automating the text
classification task. A news-story categorization system is Consistent accuracy was difficult to obtain with human indexers,
developed, where a rule base is generated by human expertise. and the work tended to cause high staff turnover. With these
Building such a system requires huge efforts from the indexing issues in mind, Carnegie Group, based in Pittsburgh, worked with
experts, taking more than a couple of months because of the Reuters [27] to develop the Construe system, an automated news
enormous number of rules. categorization system based on a fuzzy rule-based text
categorization.
On the other hand, a statistical approach based on keyword
extraction from training texts is a popular method of generating a
knowledge base [3]. This has the advantage that the knowledge 2.2 Automated Text Categorization
base can be quickly generated without much cost. However, we
In the last 15 years or so, substantial research has been conducted
need guidelines on how to gather a large quantity of training texts.
on text classification through supervised machine learning
As for automated text categorization, our results are interesting. It
techniques [13], [15]. The vast majority of studies in this area
shows that relatively knowledge-poor machine learning algorithm
focus on classification by topic, where bag-of-content-word
outperforms human beings in a text classification task. This
models turn out to be very effective. Recently, there has also been
increasing interest in automated categorization by overall

98
©2010 International Journal of Computer Applications (0975 - 8887)
Volume 1 – No. 19

sentiment, degree of subjectivity, authorship and along other Evaluation of classification results: One or more categories are
dimensions that can be grouped together under the cover terms of assigned to each text. Then, the "classification correctness"
―genre‖ and ―style‖ [4]. ("recall" and "precision") is calculated [7, 8].
Genre and style classification tasks cannot be tackled using only
the simple lexical cues that have been proven so effective in topic Results analysis and knowledge base refinement: If the degree of
detection. For instance, an objective report and a subjective correctness is not efficient enough to achieve the system‘s goals,
editorial about the Iraq war will probably share many of the same then the knowledge base will require refinement.
content words; vice versa, objective reports about Iraq and soccer
will share very few interesting content words. Thus,
categorization by genre and style must rely on more abstract 4. PROPOSED METHODOLOGY
topic-independent features. Popular choices of features [12] [14] The proposed intelligent news classifier is designed and
have been function words (that are usually discarded or down- developed for extracting the full potential of HMM and SVM
weighted in topic based categorization), textual statistics (e.g., methods. The general architecture of the proposed system is as
average sentence length, lexical richness measures), n-grams and shown in Figure 1.
part-of-speech (POS) information. There are some other
techniques like k-Nearest Neighbour (kNN) for text classification.
But when considering the accuracy factor of classification, these
algorithms are not enough to be used for news classification in
domains such as online newspapers or online journals since the
accuracy is having prime importance in these scenarios.

3. TEXT CLASSIFICATION PROCEDURE

The following steps are essential for the successful development

of an intelligent classification system.

Goal definition: The goals of the classification system should

adhere to the following requirements:

a. To be a fully automated system or a classification Fig.1. General Architecture

supporting system.
b. A number of categories are to be assigned to each 4.1 Pre-Processing
incoming text (one or plural). News articles in the form of text files are fed into the system using
c. Classification correctness. JFileChooser API. It then undergoes Parsing. Parsing is carried
out using the Java StringTokenizer. Parsing is the division of text
d. Computing time needs be limited when generating a
into a set of discrete parts, or tokens, which in a certain sequence
knowledge base and when classifying each text.
can convey a semantic meaning [29]. The String Tokenizer class
provides the first step in this parsing process, often called the
Category definition: The categories to be classified should be lexer (lexical analyzer) or scanner. StringTokenizer implements
defined beforehand. Ideally each category should be exclusive to the Enumeration interface. Therefore, given an input string, it can
each other in the keyword distribution. enumerate the individual tokens contained in it.

Text collection: A large number of texts must be prepared, some To use String Tokenizer, specify an input string that contains
of which are used for training the system, while others are used delimiters. Delimiters are characters that separate tokens. For
for evaluation purpose. Each text should be assigned to the example, ", ; :" sets the delimiters to a comma, semicolon, and
relevant categories beforehand. colon. The default set of delimiters consists of the whitespace
characters: space, tab, newline, and carriage return.
Statistical analysis of texts: Statistical analysis of the texts is very
important, for example text length, the number of texts in each The words obtained from the tokenizer form the basis for the
category, and keyword distribution in each text or in all texts. feature space of the training data. However, a fair amount of pre-
processing not only prunes down the training size, but also makes
the data more ‗clean‘ and capable of training the classifier more
Knowledge base generation: A knowledge base is generated using effectively [22].
training texts. Weighted keywords [30], which define the
character of each category, are extracted and stored in the The conversion of the entire text to lower-case and removal of
knowledge base. non-alphanumeric contents are followed by Stop-word elimination
and grammatical stemming. Many text classification systems use

99
©2010 International Journal of Computer Applications (0975 - 8887)
Volume 1 – No. 19

stopword list to delete "noise" words [33] before going to the a. Quantity: To generate a knowledge base from a large
classification algorithm. The system examined two types of quantity of classified training articles takes much
stopwords: (a) the words appearing in every category (these words computing time [10]. However, the fewer the training
depend on text domains, the number of categories, and the volume articles the worse the classification correctness
of the training texts), and (b) the words which do not apparently becomes.
characterize any category, examples of which are ‗the‘, ‗is‘, b. Publication date time-lapse between training and
‗doing‘, ‗have‘, etc that appear very frequently in all the evaluation articles: The topics of the news articles are
documents, but almost always carry no useful information. continually changing and new words are constantly
The main interest is in words that most frequently characterize the appearing. We assume that the classification correctness
news into any one of the classes and virtually assure that a given would be better if the time elapsed between the
news article belongs to a particular category only. publication dates of the training and the evaluation
articles were closer. This parameter is also relevant to
The next step is to go in for grammatical stemming [25]. A continually maintaining the quality of the knowledge
stemmer is an algorithm that determines a stem form of a given base.
word. The stem need not be identical to the morphological root of
the word; it is usually sufficient that related words map to the
same stem, even if this stem is not in itself a valid root. The 4.2 Hidden Markov Model
purpose of stemming is to map different forms of an identically A method of feature selection in texts and an effective technique
implying word to the same feature in the classifier‘s training of classifying them are described in this part [40]. When
space. classifying texts, words included in them are used as classification
Since the knowledge base plays an important role in this type of features [21]. Undoubtedly, Markovian models are now regarded
classification system, the method of training set selection, as one of the most significant state-of-the-art approaches for
generation and refinement of knowledge base are crucial problems sequence learning. Besides several applications in pattern
[20]. In this paper, for building a knowledge base, the following recognition and molecular biology, HMMs have also been applied
parameters are considered. to text related tasks, including natural language processing [7]
and, more recently, information retrieval and extraction [14, 36].
The recent view of the HMM as a particular case of Bayesian
4.1.1 Treatment of stopword
networks [11], has helped their theoretical understanding and the
ability to conceive extensions to the standard model in a sound
Though the necessity of stopwords has rather been discussed [5], and formally elegant framework. HMMs are used in this project
[6], the creation of training set also involves the removal of for the feature extraction and primary classification of the given
stopwords. input news.
We consider that a text is a sequence of observations O=
4.1.2 Keyword weighting (O1,…,Ot). The observations Ot correspond to the tokens of the
text. Technically, each token is a vector of attributes generated by
A set of noun words, verb words and unit symbols appearing in an a collection of NLP tools. We should attach a semantic tag X i to
article as ‗keywords‘ are defined here [19], [26]. The system some of the tokens Ot. An extraction algorithm maps an
observation sequence O1,…, Ot to a single sequence of tags
adopts the frequency of the keyword [28,38] as its assigned
weight [30]. These weights may be normalized between where .
categories in order to adjust to any difference in the quantity of
training articles assigned to each category. An HMM =( ,A,B) consists of finitely many states {S1,…,Sn}
with probabilities I = P(q1 = Si) , the probability of starting in
state Si and aij = P(qt+1 = Sj | qt = Si ), the probability of transition
4.1.3 Scoping for keywords in a defined area from state Si to Sj . Each state is characterized by a probability
distribution bi (Ot) = P (Ot |qt= Si) over observations. Given an
As it is well known, most keywords are included in the headline observation sequence O= (O1,…,OT), and according to Bayes‘
principle, for each observation Ot, we have to return the tag Xi
and within the first paragraph of any news article [37]. Therefore,
defining which areas in an article are to be checked for extracting which maximizes the probability P ( t X i O ) , which means
keywords may be more effective. that we should identify a sequence of states q 1,…,qT which
maximize P(qt = Si | O, ) and return that tag Xi that corresponds
Selection of training articles plays a vital role in the over all to the state Si for each token Ot.
performance of the classifier. If training articles are not selected The forward-backward algorithm of HMM is used further.
appropriately, the classification system will be of no use even if
the classification algorithm is of excellent quality. In this paper, The algorithm comprises three steps:
the following two kinds of parameters are considered in selecting
training articles:
1. Computing forward probabilities
2. Computing backward probabilities

100
©2010 International Journal of Computer Applications (0975 - 8887)
Volume 1 – No. 19

3. Computing smoothed values Baum- Welch algorithm solves it, which estimate the parameters
using the following formulae

αt (i) P(qt Si , O1,…, Ot | λ ) is the forward variable. It k Tk 1 k

quantifies the probability of reaching state S i at time t and k 1 t 1 t (i, j )
observing the initial part O1,…, Ot of the observation aij (1)
sequence. . βt (i) P(Ot 1,… , OT | qt Si , λ ) is the backward k Tk 1 k
k 1 t 1 t (i )
variable and quantifies the chance of observing the rest sequence
Ot 1,… , OT when in state Si at time t. Of course, αt (i) and βt (i)
can be computed and then we can express the probability of being
k Tk k
in state Si at time t given observation sequence O, it is γ t (i) k 1 t (i )
t 1,Otk vn

αt (i)βt (i) / P(O | λ ) b j (i ) (2)

k Tk k
k 1 t 1 t ( j)
The extraction using HMM can be described as follows.
Input text T = (W1,…,Wt); HMM ; set of tags X1,…,Xn
k k
corresponding to target HMM states S1,…,Sn Where t (i , j ) and t (i ) are the joint event and the state
Generate sequence O = (O1,…,Ot), where Ot is a vector containing variable associated with the kth observation sequence respectively
word wt. [34]. The feature extraction using HMM is clearly depicted in
Call Forward-Backward algorithm and calculate Figure 2.

If then output ―<Xi>wt<Xi>‖;

else output wt.
After the feature extraction process, the output can be normalized
into a new feature vector, and then the trained SVM classifier is
ready to be used for classifying a new text.

Consider K classes l = {l1,…, lk} with their respective HMM set

λ′ = {λ1…, λk} , where λi = {λi1…, λiNi} So the total number of
k
HMM N Ni .
i 1

Compute probability, P (O| λ) for a group of HMM λi = {λi1…, Figure 2. HMM Based Feature Extraction
λiNi}.
Suppose,
4.3 Support Vector Machine.
i Support Vector Machine (SVM) is a classification technique that
Pm axli max P(O j ) was first applied to text categorization by Joachims [1, 9]. It is a
1 j Nt
powerful supervised learning paradigm based on the structured
Where the label li corresponds the maximum probability in the risk minimization principle. During training, this algorithm
group. Of course, we get the feature vector g= {w1… wt… wm} constructs a hyperplane that maximally separates the positive and
with the HMM λij . Then li and g are combined into a new feature negative instances in the training set. Classification of new

vector g {li , g} . Normalize the obtained new feature vector instances is then performed by determining which side of the
~
hyperplane they fall on [16].
by g g
g
2 Most of the previous studies that apply SVM to text categorization
use all the words in the document collection without any attempt
and this plays an important role in SVM classification for pre- to identify the important keywords [17]. On the other hand, there
processing. are various remarkable studies on keyword selection for text
categorization in the literature [23]. As stated above, these studies
mainly focus on keyword selection metrics and employ either the
The above section described a clear idea about the working of corpus-based or the class-based keyword selection approach, do
HMM as a feature Extractor. Next comes the learning of
not use standard datasets, and mostly lack a time complexity
parameters λ of an HMM from a set O {O1, O2 ,…, OK } of analysis of the proposed methods [32, 39]. In addition, most
example sequences [41]. Applying another algorithm called studies do not use SVM as the classification algorithm. For

instance, Yang [12] and Pedersen [21] use kNN, and Mladenic for every pair of distinct classes i and j [35]. Each classifier C ij is
and Grobelnic [22] use Naive Bayes [31] in their studies on trained with samples in the i th class with positive labels and the
keyword selection metrics. Later studies reveal that SVM samples in the jth class with negative labels. The decision
performs consistently better than these classification algorithms. function used in this classification is given by
Assume a group of examples {(x1, y1), (x2, y2),.., (xk, yk)} where ~ ~
li l j
~
li l j li l j
li l j
xk R and yk 1, 1 . We consider decision functions f ij ( g ) y ( g , g) b (8)
n n
of the form sgn ((w.x) + b), where (w.x) denotes the inner product n
of w and x. So a decision function fw, b should be found with the
properties i j, i 1,...,M
Where is the total number of the i th and jth classes from the
(3) training data? We can see that SVM method described above is
one-against-one method [2]. When given an unknown sample, if
But in many cases, the separating hyper plane does not exit. To the decision function predicted that the sample belongs to class i,
allow for possibilities for violating Equation (3), slack variables the classifier Cij attributes one vote for that class, otherwise the
like vote is attributed to class j. When all the votes from all the binary
classifiers are obtained, the unknown sample belongs to the class
(4) having the highest votes.

Subject to
5. EXPERIMENTAL RESULTS
To evaluate the effectiveness of the new classification method
are applied. proposed in this paper, we choose a text set, which consists of
1200 news, taken from The Hindu [42], The New Indian Express
The above minimization problem is a constrained quadratic [43], Business Line [44], The Economic Times [45], and Times of
programming (QP) problem, which can be formulated as the India [46], distributed among three major categories such as
following convex QP problem: sports, finance, and politics. In our experiment, 800 texts are used
k
1 k k as training set, and the rest 400 texts are used as testing set. The
Maximize ai ai a j yi y j ( xi x j ) (5) distribution of the test news from various news sites is in Table 1.
i 1 2 i 1 j 1

k Table 1. Test news distribution among various classes

Subject to yi i 0,0 i C (i 1,..., k ) Source
The
i 1 The Times
The New Business
Economic of Total
Where αi are Lagrange multipliers, C is a parameter that assigns Hindu Indian Line
Times India
Express
penalty cost to misclassification of samples. By solving the above Classes
QP problem, the solution gives rise to a decision function of the
Sports 58 30 0 0 18 106
form:
Finance 30 22 40 45 27 164
k
(6) Politics 48 31 5 5 41 130
f ( x) sgn y ( xi x) b
i i
i 1 Total 136 83 45 50 86 400
Where b is a bias term. Only a small fraction of the coefficient α i
is nonzero. The corresponding pairs of entries are known as HMMs are trained to extract features from texts of every class,
support vectors and they fully define the decision function [5, 6]. and multi class SVMs are trained to find the separating decision
Thus, the above decision function is expressed as an inner product hyper plane that maximizes the margin of the classified
of the data. This observation leads to the generalization to the categories. Out of the 106 sports news input, 98 were classified
nonlinear case, which is achieved by mapping the problem data to correctly. Among the 164 finance news input, 158 news were
a higher dimensional space H(feature space) through a correctly classified and out of the 130 politics test news 118 were
transformation of the form x i.xj=Ø(xi) Ø(xj), the mapping function classified as politics news.
is implicitly defined through a symmetric positive definite kernel
function K(xi, xj)= Ø(xi) Ø(xj). Then the decision function Up on analyzing the misclassified news we found that there exists
can be rewritten as: some sort of ambiguity in the text features in the news input. For
instance the terror attack on Sri Lankan cricket players at Lahore
k
contains some features that can mislead our classifier to classify
f ( x) sgn y K ( xi , x) b
i i
(7) the news as sports news, even though it belongs to other
i 1 (International) category. The classification accuracy of the
A training algorithm of multi-class SVM can be described as the proposed system for categories Sports, Finance and Politics are
task of constructing a number of binary SVMs; one classifier Cij 92.45%, 96.34% and 90.76% respectively. We compared the

classification accuracy of this method with that of kNN and SVM, International Conference on Acoustics, Speech and Signal
and the test results are shown in Table 2. Processing (ICASSP03), April 6-10 2003.
[3] D. Tao, X. Tang, X. Li, and X. Wu. Asymmetric Bagging
and Random Subspacing for Support Vector Machines-based
Table 2. News Classification accuracy of three methods
Relevance Feedback in Image Retrieval, IEEE Transactions
Method KNN SVM HMM- on Pattern Analysis and Machine Intelligence, 2006.
(%) (%) SVM [4] Lei Tang, Huan Liu. Bias Analysis in Text Classification for
Category (%) Highly Skewed Data, Proceedings of Fifth IEEE
International Conference on Data Mining (ICDM'05), 2005,
Sports 83.25 87.67 92.45 pp. 781-784.
Finance 80.22 82.57 96.34 [5] J. Brank & M. Grobelnik, N. Milic-Frayling, D. Mladenic.
Politics 82.26 86.55 90.76 Training text classifiers with SVM on very few positive
examples. Microsoft Research technical report MSR-TR-
2003-34. 2003.
The proposed method‘s classification accuracy is plotted with that [6] D. Lin & P. Pantel. Discovery of Inference Rules for
of kNN and SVM classifiers in Figure 3. Question Answering. Natural Language Engineering 2001
7(4):343-360.
[7] D. Lin & P. Pantel. Induction of Semantic Classes from
News classification of various methods Natural Language Text. In Proceedings of ACM SIGKDD
Conference on Knowledge Discovery and Data Mining 2001.
100 pp. 317-322.
95
[8] E.Voorhees. Query expansion using lexical-semantic
Accuracy

90 Sports relations. In Proceedings of the 17th Annual International

85 Finance ACM SIGIR Conference on Research and Development in
80 Politics Information Retrieval, 1994, pp. 61—69.
75 [9] Joachims, T.: Text Categorization with Support Vector
70 Machines: Learning with Many Relevant Features. European
(%) (%) (%) Conference on Machine Learning (ECML) (1998) Text
KNN SVM HMM-SVM Categorization with Class-Based and Corpus-Based
Methods Keyword Selection.
[10] Ozgur, L., Gungor, T., Gurgen, F.: Adaptive Anti-Spam
Figure 3. Graphical representation of classification accuracy Filtering for Agglutinative Languages. A Special Case for
of various methods Turkish, Pattern Recognition Letters, 25 no.16 (2004) 1819–
1831.
[11] McCallum, A., Nigam, K.: A Comparison of Event Models
6. CONCLUSION for Naive Bayes Text Classification. Sahami, M. (Ed.), Proc.
The intelligent News Classifier is developed and experimented of AAAI Workshop on Learning for Text Categorization
with online news from web for the category Sports, Finance and (1998), Madison, WI, 41–48.
Politics. The novel approach combining two powerful algorithms, [12] Yang, Y., Liu, X.: A Re-examination of Text Categorization
Hidden Markov Model and Support Vector Machine, in the online Methods. In Proceedings of SIGIR-99, 22nd ACM
news classification domain provides extremely good result International Conference on Research and Development in
compared to existing methodologies. By the introduction of Information Retrieval, Berkeley, US (1996).
several preprocessing techniques and the application of filters we [13] Sebastiani, F.: Machine Learning in Automated Text
reduced the noise to a great extent, which in turn improved the Categorization. ACM Computing Surveys 34 no. 5 (2002) 1–
classification accuracy. Preprocessing in the training data set 47.
significantly reduced the training computational time. The
experimental result shows the performance of this new approach [14] Forman, G.: An Extensive Empirical Study of Feature
compared to the existing techniques. Selection Metrics for Text Classi.cation. Journal of Machine
Learning Research 3 (2003) 1289–1305.
[15] Ozgur, A.: Supervised and Unsupervised Machine Learning
7. REFERENCES Techniques for Text Document Categorization. Master‘s
Thesis (2004), Bogazici University, Turkey.
[16] Burges, C. J. C.: A Tutorial on Support Vector Machines for
[1] T. Joachims, Learning to Classify Text using Support Vector
Pattern Recognition. Data Mining and Knowledge Discovery
Machines, Kluwer, 2002.
Vol. 2 No. 2 (1998) 121–167.
[2] R. Yan, Y. Liu, A. Hauptmann. On Predicting Rare Classes
with SVM Ensembles in Scene Classification. IEEE

[17] Joachims, T.: Advances in Kernel Methods-Support Vector Symposium on IT in Medicine and Education, 2008. ITME
Learning. chapter Making Large-Scale SVM Learning 2008, 12-14 Dec. 2008.
Practical MIT-Press (1999). [33] Agarwal, S.; Godbole, S.; Punjani, D.; Shourya Roy; How
[18] Lin, S-H., Shih C-S., Chen, M. C., Ho, J-M.: Extracting Much Noise Is Too Much: A Study in Automatic Text
Classi.cation Knowledge of Internet Documents with Mining Classification, Seventh IEEE International Conference on
Term Associations: A Semantic Approach. In Proc. of Data Mining, 2007. ICDM 2007.
ACM/SIGIR (1998), Melbourne, Australia 241–249. [34] Makrehchi, M.; Kamel, M.S.; Combining feature ranking for
[19] Azcarraga, A. P., Yap, T., Chua, T. S.: Comparing Keyword text classification, IEEE International Conference on
Extraction Techniques for Websom Text Archives. Systems, Man and Cybernetics, 2007. ISIC. 7-10 Oct. 2007.
International Journal of Artificial Intelligence Tools 11 no. 2 [35] Guifa Teng; Yihong Liu; Jianbin Ma; Fang Wang; Huiting
(2002). Yao; Improved Algorithm for Text Classification Based on
[20] Aizawa, A.: Linguistic Techniques to Improve the TSVM, First International Conference on Innovative
Performance of Automatic Text Categorization. In Computing, Information and Control, 2006. ICICIC '06.
Proceedings of 6th Natural Language Processing Pacific Rim Volume 2, Aug. 30 2006-Sept. 1 2006.
Symposium (2001), Tokyo, JP 307–314. [36] Hui He; Bo Chen; Jun Guo; Semi-supervised Chinese
[21] Yang, Y., Pedersen J. O.: A Comparative Study on Feature compound word extraction based on HMM, 7th World
Selection in Text Categorization. In Proceedings of the 14th Congress on Intelligent Control and Automation, 2008.
International Conference on Machine Learning (1997) 412– WCICA 2008. 25-27 June 2008.
420. [37] Wei Hu; Dong-Mo Zhang; Huan-Ye Sheng; Vague events-
[22] Mladenic, D., Grobelnic, M.: Feature Selection for based Chinese Web news classification, Proceedings of 2004
Unbalanced Class Distribution and Naive Bayes. In International Conference on Machine Learning and
Proceedings of the 16th International Conference on Cybernetics, 2004.
Machine Learning (1999) 258–267. [38] Kroha, P.; Baeza-Yates, R.; A Case Study: News
[23] Salton, G., Yang, C., Wong, A.: A Vector-Space Model for Classification Based on Term Frequency, Proceedings.
Automatic Indexing. Communications of the ACM 18 no.11 Sixteenth International Workshop on Database and Expert
(1975) 613–620. Systems Applications, 2005. 26-26 Aug. 2005.
[24] ftp://ftp.cs.cornell.edu/pub/smart/ (2004). [39] Lisbon, Proceedings of the ACM first Ph.D. workshop in
[25] Porter, M. F.: An Algorithm for Suffix Stripping. Program 14 Information and Knowledge Management, Portugal, 2007.
(1980) 130–137. [40] Jun-Peng Bao Jun-Yi Shen Xiao-Dong Liu Qin-Bao
[26] Salton, G., Buckley, C.: Term Weighting Approaches in Song , A new text feature extraction model and its
Automatic Text Retrieval. Information Processing and application in document copy detection, I International
Management 24 no. 5 (1988) 513–523 Conference on Machine Learning and Cybernetics, 2003.

[27] Lewis, D. D.: Reuters-21578 Document Corpus V1.0, [41] Harriman, Feature selection and feature extraction for text
http://kdd.ics.uci.edu/databases/reuters21578/reuters21578.ht categorization, Proceedings of the workshop on Speech and
ml. Natural Language, Human Language Technology
Conference archive.
[28] Kroha, P.; Baeza-Yates, R. A Case Study: News
Classification Based on Term Frequency , Sixteenth [42] www.hinduonnet.com
International Workshop on Database and Expert Systems [43] www.expressbuzz.com
Applications, 2005. Proceedings. [44] www.thehindubusinessline.com
[29] Lin Lv; Yu-Shu Liu; Research of English text classification [45] http://economictimes.indiatimes.com
methods based on semantic meaning: ITI 3rd International
Conference on information and Communications [46] http://timesofindia.indiatimes.com
Technology, 2005. Enabling Technologies for the New
Knowledge Society.
[30] Islam, Md. Rafiqul; Islam, Md. Rakibul; An effective term
weighting method using random walk model for text
classification, 11th International Conference on Computer
and Information Technology, 2008. ICCIT 2008. 24-27 Dec.
2008.
[31] Sang-Bum Kim; Kyoung-Soo Han; Hae-Chang Rim; Sung
Hyon Myaeng; Some Effective Techniques for Naive Bayes
Text Classification, IEEE Transactions on Knowledge and
Data Engineering, Volume 18, Issue 11, Nov,2006.
[32] Miao Zhang; De-xian Zhang; Trained SVMs based rules
extraction method for text classification, IEEE International

104

View publication stats

A Comparative Analysis of Logistic Regression, Random Forest and KNN Models For The Text Classification
No ratings yet
A Comparative Analysis of Logistic Regression, Random Forest and KNN Models For The Text Classification
16 pages
TRM401 Energy-Savings-Calculator pump-and-fan-VFD v4 1 14
No ratings yet
TRM401 Energy-Savings-Calculator pump-and-fan-VFD v4 1 14
30 pages
SAP SD Credit Memo, Debit Memo and Return Order
100% (2)
SAP SD Credit Memo, Debit Memo and Return Order
21 pages
BMS Procedure
100% (3)
BMS Procedure
138 pages
Name: Tran Nguyen Anh Thoai: Course Code: Courseword Leader: Due Date: Centre: Greenwich, HCMC Word
No ratings yet
Name: Tran Nguyen Anh Thoai: Course Code: Courseword Leader: Due Date: Centre: Greenwich, HCMC Word
53 pages
Deep Learning
No ratings yet
Deep Learning
42 pages
Machine Learning in Automated Text Categorization
No ratings yet
Machine Learning in Automated Text Categorization
55 pages
TM05
No ratings yet
TM05
21 pages
Techniques of Text Classification
No ratings yet
Techniques of Text Classification
28 pages
Machine Learning Telugu
No ratings yet
Machine Learning Telugu
9 pages
External Environment Affecting Business in Nigeria
No ratings yet
External Environment Affecting Business in Nigeria
9 pages
An Overview of E-Documents Classification: January 2009
No ratings yet
An Overview of E-Documents Classification: January 2009
10 pages
Leporello Aluminium Casting Alloys RHEINFELDEN ALLOYS 2018
No ratings yet
Leporello Aluminium Casting Alloys RHEINFELDEN ALLOYS 2018
10 pages
Text Classification MLND Project Report Prasann Pandya
No ratings yet
Text Classification MLND Project Report Prasann Pandya
17 pages
News Article Category Predictor
No ratings yet
News Article Category Predictor
6 pages
The Obu Manuvo Tribe
No ratings yet
The Obu Manuvo Tribe
2 pages
An Overview of Categorization Techniques: B. Mahalakshmi, Dr. K. Duraiswamy
No ratings yet
An Overview of Categorization Techniques: B. Mahalakshmi, Dr. K. Duraiswamy
7 pages
Survey On Text Classification
No ratings yet
Survey On Text Classification
7 pages
First Summative Test in English 5
No ratings yet
First Summative Test in English 5
2 pages
Learner Autonomy and Vocabulary Development For Female Learners of English As A Foreign Language at A College Level in The Kingdom of Saudi Arabia
No ratings yet
Learner Autonomy and Vocabulary Development For Female Learners of English As A Foreign Language at A College Level in The Kingdom of Saudi Arabia
4 pages
An Automatic Document Classifier System Based On Genetic Algorithm and Taxonomy
No ratings yet
An Automatic Document Classifier System Based On Genetic Algorithm and Taxonomy
8 pages
TLE7 - 8-ICT-PROGRAMMING FOR ROBOTICS Q1 M1 W1 - noAK
No ratings yet
TLE7 - 8-ICT-PROGRAMMING FOR ROBOTICS Q1 M1 W1 - noAK
16 pages
Comparison of Text Classifiers On News Articles
No ratings yet
Comparison of Text Classifiers On News Articles
5 pages
Automatic Induction of Rule Based Text Categorization
No ratings yet
Automatic Induction of Rule Based Text Categorization
10 pages
Deng Et Al. - 2019 - Feature Selection For Text Classification A Review
No ratings yet
Deng Et Al. - 2019 - Feature Selection For Text Classification A Review
20 pages
111 1460444112 - 12-04-2016 PDF
No ratings yet
111 1460444112 - 12-04-2016 PDF
7 pages
Learning Context For Text Categorization
No ratings yet
Learning Context For Text Categorization
9 pages
A 10-Bit 50-MS Per S SAR ADC With A Monotonic Capacitor Switching Procedure
No ratings yet
A 10-Bit 50-MS Per S SAR ADC With A Monotonic Capacitor Switching Procedure
10 pages
Ijermt Jan2019
No ratings yet
Ijermt Jan2019
9 pages
Theme-Based Retrieval of Web News
No ratings yet
Theme-Based Retrieval of Web News
2 pages
MW Product Knowledge
No ratings yet
MW Product Knowledge
79 pages
127 1498038923 - 21-06-2017 PDF
No ratings yet
127 1498038923 - 21-06-2017 PDF
9 pages
Text Classification PDF
No ratings yet
Text Classification PDF
7 pages
Bogery Et Al. - 2019 - Automatic Semantic Categorization of News Headline
No ratings yet
Bogery Et Al. - 2019 - Automatic Semantic Categorization of News Headline
8 pages
Module 3 Joint Arrangements
No ratings yet
Module 3 Joint Arrangements
19 pages
A Survey On Different Types of Approaches To Text Categorization
No ratings yet
A Survey On Different Types of Approaches To Text Categorization
3 pages
Mensuration
No ratings yet
Mensuration
6 pages
Chapter-5 Push and Pull Model
No ratings yet
Chapter-5 Push and Pull Model
12 pages
News Classification Using Machine Learning
No ratings yet
News Classification Using Machine Learning
5 pages
Bai Tap Ham Tai Chinh
No ratings yet
Bai Tap Ham Tai Chinh
4 pages
Pressure Volume Curve 2005
No ratings yet
Pressure Volume Curve 2005
22 pages
Researchpaperclassification IEEEprocedding 1
No ratings yet
Researchpaperclassification IEEEprocedding 1
7 pages
Comparative Analysis of Short Film
No ratings yet
Comparative Analysis of Short Film
4 pages
Unit-5Cognitive System Design Principles
No ratings yet
Unit-5Cognitive System Design Principles
72 pages
Unit 2
No ratings yet
Unit 2
26 pages
Pinterest
No ratings yet
Pinterest
6 pages
Unit 2 Notes
No ratings yet
Unit 2 Notes
27 pages
04 - RPi Pico - Measure Distance With Ultrasonic Sensor HC-SR04
No ratings yet
04 - RPi Pico - Measure Distance With Ultrasonic Sensor HC-SR04
6 pages
17 - Project Report - NLP-2-27
No ratings yet
17 - Project Report - NLP-2-27
26 pages
17 Result Analysis NLP
No ratings yet
17 Result Analysis NLP
13 pages
Theis Finaldoc
No ratings yet
Theis Finaldoc
86 pages
Science Research Journal
No ratings yet
Science Research Journal
7 pages
A Survey On Machine Learning Techniques
No ratings yet
A Survey On Machine Learning Techniques
8 pages
December 2024: Top 10 Read Articles in Data Mining & Knowledge Management Process
No ratings yet
December 2024: Top 10 Read Articles in Data Mining & Knowledge Management Process
31 pages
Project Proposal - Group 17-2-5
No ratings yet
Project Proposal - Group 17-2-5
4 pages
CV Thabet English
No ratings yet
CV Thabet English
2 pages
Updated CV Hrithik Mhatre
No ratings yet
Updated CV Hrithik Mhatre
2 pages
J24 Jimmys Combo
No ratings yet
J24 Jimmys Combo
54 pages
Machine Learning For Text Document Classification-Efficient Classification Approach
No ratings yet
Machine Learning For Text Document Classification-Efficient Classification Approach
8 pages
Text Classification Research Paper 2
No ratings yet
Text Classification Research Paper 2
7 pages
Paper 1 - 1662-Article Text-12759-12507-10-20210526
No ratings yet
Paper 1 - 1662-Article Text-12759-12507-10-20210526
2 pages
Paper 3 - OnLineNewClassificationUsingMachineLearning
No ratings yet
Paper 3 - OnLineNewClassificationUsingMachineLearning
2 pages
A Complete Process of Text Classification System Using State of The Art NLP Models
No ratings yet
A Complete Process of Text Classification System Using State of The Art NLP Models
26 pages
End CRP
No ratings yet
End CRP
26 pages
Article Classification Using Natural Language Processing and Machine Learning
No ratings yet
Article Classification Using Natural Language Processing and Machine Learning
8 pages
Text Categorization Performance Examination Using Machine Learning Algorithms (PRINTED)
No ratings yet
Text Categorization Performance Examination Using Machine Learning Algorithms (PRINTED)
6 pages
Text Classification Based On Machine Learning and
No ratings yet
Text Classification Based On Machine Learning and
12 pages
Daa 2
No ratings yet
Daa 2
4 pages
Unit 3
No ratings yet
Unit 3
27 pages
NFXP2-SSG-QM-MSC-00002 - QAQC Requirements For Vendors - Rev.1
No ratings yet
NFXP2-SSG-QM-MSC-00002 - QAQC Requirements For Vendors - Rev.1
13 pages
Group08 - BDM01 - Topic Modelling in Text Classification
No ratings yet
Group08 - BDM01 - Topic Modelling in Text Classification
19 pages
Semester V (2022-25)
No ratings yet
Semester V (2022-25)
1 page
Adding and Subtracting Integers Lesson Plan
No ratings yet
Adding and Subtracting Integers Lesson Plan
3 pages
A Neural Network For Classifying News Wires (Multi Class Classification) Using Reuters Dataset
No ratings yet
A Neural Network For Classifying News Wires (Multi Class Classification) Using Reuters Dataset
16 pages
IEEE-paper On NLP
No ratings yet
IEEE-paper On NLP
3 pages
IEEE-paper (1) Original
No ratings yet
IEEE-paper (1) Original
3 pages
Text Classification
No ratings yet
Text Classification
7 pages
Top 10 DAX Interview Questions and Answers
No ratings yet
Top 10 DAX Interview Questions and Answers
3 pages
Machine Learning Models For News Article Classification
No ratings yet
Machine Learning Models For News Article Classification
8 pages
NM Report
No ratings yet
NM Report
18 pages
Chemical Burn
No ratings yet
Chemical Burn
32 pages
Text Classification Using Support Vector Machine IJERTV1IS3174
No ratings yet
Text Classification Using Support Vector Machine IJERTV1IS3174
4 pages
Analytics of Machine Learning-Based Algorithms For Text Classification
No ratings yet
Analytics of Machine Learning-Based Algorithms For Text Classification
11 pages
Test Bank For Understanding Economics A Contemporary Perspective, 9th Edition Mark Lovewell
100% (1)
Test Bank For Understanding Economics A Contemporary Perspective, 9th Edition Mark Lovewell
10 pages
H2S Drill Procedure - WJO & NDSC - English Version
No ratings yet
H2S Drill Procedure - WJO & NDSC - English Version
1 page
Lec # 4-1
No ratings yet
Lec # 4-1
15 pages

A New Text Mining Approach Based On HMM-SVM For Web News Classification

Uploaded by

A New Text Mining Approach Based On HMM-SVM For Web News Classification

Uploaded by

See discussions, stats, and author proﬁles for this publication at: https://www.researchgate.

Article in International Journal of Computer Applications · February 2010

Cloud Computing View project

Image Forensics View project

The user has requested enhancement of the downloaded ﬁle.

A New Text Mining Approach Based on HMM-SVM for Web

3. TEXT CLASSIFICATION PROCEDURE

The following steps are essential for the successful development

Goal definition: The goals of the classification system should

a. To be a fully automated system or a classification Fig.1. General Architecture

αt (i) P(qt Si , O1,…, Ot | λ ) is the forward variable. It k Tk 1 k

αt (i)βt (i) / P(O | λ ) b j (i ) (2)

If then output ―<Xi>wt<Xi>‖;

Consider K classes l = {l1,…, lk} with their respective HMM set

k Table 1. Test news distribution among various classes

90 Sports relations. In Proceedings of the 17th Annual International

View publication stats

You might also like