Key-Phrase Extraction For Classification
Key-Phrase Extraction For Classification
**
TEI of Athens/Dept. of Informatics, Professor, Aegaleo, Greece
Abstract: In this paper we consider the problem of number of occurrences. Let see a bilingual portion of a
extracting key-phrases from a bilingual texts discharge letter covering past history and presentation:
collection and using them for text classification. A “71 years old male patient with a history of a AAA
key-phrase could be defined as a sequence of words (Abdominal Aortic Aneurysm) repair 13 months ago
of a given size in a given partial order that occur presented with RUQ (Right Upper Quandrum) pain and
within a sentence. We describe an algorithm for the a palpable mass of two months duration”.
discovery of key-phrases. Then, a framework of Stop-words could be words as the following ones:
handling multilingual texts / documents is described with, a, of, and that have no implication in retrieving
which combines the use of the traditional vector texts. Key-phrases could be sequences of words as the
space model with a new similarity measure which is following ones:
based on the key-phrases. This framework is used
for finding the most similar documents of a training AAA Abdominal Aortic Aneurysm
set with any new document and selecting the classes RUQ Right Upper Quandrum pain
of the similar documents as the most plausible ones
for the new document. Some experimental results The basic problem in analyzing such a text
are also presented. collection is to find key-phrases, i.e., sequences of
words occurring frequently close to each other. For
Introduction example, a phrase "A (followed by) B (followed by) C
(followed by) D", where A, B, C, D are four distinct
Phrase extraction is the subject of interesting words, must occurs a specific number of times in order
research accompanied by various experimental and to be considered as a key-phrase. Note that in the
operational tools. It is worth mentioning here that these sentence there can be other words occurring between
tools are usually oriented towards the extraction of these ones. The user must define how close is close
information from Web applications. As an example we enough by giving the width of the word window within
can mention the case of the KEA system [1] which which the key-phrase must occur. The user also must
implements in Java a rather simple algorithm for specify how often a key-phrase has to occur to be
extracting phrases from English text. considered frequent.
In the case of the Greek language there is a rich
syntactic and inflectional (grammatical) system that Discovery of frequent key-phrases
implies further difficulties in the extraction process.
Hence, the use of a stop-words list, some Mannila et. al. [6] describe an algorithm for discovering
morphological analysis, stemming, etc are frequent episodes in a telecommunication network alarm
prerequisites for handling Greek-Latin text. It is also database. The essential points of their method are the
interesting to see the problem of extracting phrases following:
from texts written in other languages with rich a. Input data is a flat sequence of ordered episodes
inflectional system [2]. (faults) and is not organized into other structural
We must also stress the importance of using these levels. This fact implies that the situation in our
extracted phrases as terms characterizing the document case is different. We have documents / texts
and their store and organization as a basis for effective collections where words and phrases compose
free-text searching. sentences, sentences compose documents and
documents compose the collection.
Key-phrase extraction b. Their main idea is that for every frequent sequence
of events all the subsequences are at least equal
Most machine learning and text mining techniques
frequent. Therefore, the construction of candidate
are adapted towards the analysis of text collections.
sequences of n+1 width (Cn+1) can be based on the
Texts are composed of words or phrases and have an
frequent sequences of n width (Ln ). The implication
inherent sequential structure. Such a text can be
is that the search space could be reduced.
viewed as a sequence of words, stop-words,
punctuation marks, parentheses, key-phrases, where
each key-phrase has an associated frequency e.g. the
c. The episodes of a sequence are successive but in syntactic analyzers) and works restrictively in the
the sequence can be other events occurring selection of key-phrases for classification.
between the episodes. The above discussion combined with the analysis of
We think that the choice of sequences of events the method in [6] have influnced us in the construction
which is based on high frequencies is a reliable method of a new algorithm for key-phrase extraction for
for forecasting in general. An adaptation of such a classification.
method in a text processing environment can be helpful We formalize the problem of phrase extraction for
and especially in the creation of a type ahead wizard. classification in the following way. Given a collection
However, if key-phrases (sequences of words) are of documents subdivided into classes, a window width
used as indexing terms for information retrieval it is and a frequency threshold, find all key-phrases that
better to choose phrases that exist in a few texts (if the occur frequently enough in one or few classes but do not
candidate phrases exist in many texts then they are occur frequently enough in other classes. We describe
useless for retrieval purposes) and are quite frequent an algorithm for solving this task. The algorithm has two
within these texts. Therefore, it is better to use alternating phases: building new candidate key-phrases,
appropriate measures that prefer / choose such phrases. and evaluating how often these ones occur in a class of
A simple measure in this category is the following: the collection.
N The idea of building candidate patterns from smaller
freq(P,D) x ------------ ones is incorporated to the algorithm. Such an idea has
docfreq(P) been profitably used in the discovery of association
where freq(P,D) is the frequency of phrase P in rules etc. and occurs also in other contexts [6, 7].
document D, docfreq(P) is the number of documents in Adapting the main ideas discussed in [6] we can
the collection that include phrase P and N is the claim that the efficiency of our algorithm is based on the
number of documents in the collection. fact: Potentially, a very large number of candidate key-
An alternation of the above measure is used in the phrases has to be checked. Hence, we can reduce the
KEA system [3] to build a prediction model based on a search space by building larger key-phrases from
training set of documents. The following equation smaller ones. In other words, it is only necessary to test
describes this measure: the occurrences of key-phrases whose all sub key-
freq(P,D) docfreq(P) phrases are frequent.
TFxIDF = ----------- x –log2 ------------
size(D) N
Algorithm
If key-phrases are used as features for
texts/documents classification [4, 5] then the frequent We give an algorithm for finding key-phrases that
key-phrases are inappopriate for such a task. occur frequently enough in one or few classes but do not
If the choice of key-phrases for text classification is occur frequently enough in many classes:
based on measures, as the one used by the KEA
system, then there are some potential problems: 1 For every class of the training set do
1. A candidate key-phrase that exists in many 2 For every document of the class do
documents of only one class (and not in another 3 Stemming
4 stopword removal
class) could be erroneously rejected if the number 5 End {For every document of the class}
of the documents of this class is greater than the 6 Choose the most frequent stems of the
number of documents of other classes. class (P0 - parameter)
7 Form the candidate double word phrases
2. A candidate key-phrase could be erroneously (C2) from the frequent stems (L1)
chosen in case that it exists in a small subset of 8 Choose the most frequent double word
texts of a numerous / dense class and all these phrases (L2) (W1 and P1 - parameters)
texts are dedicated on a specific subtopic of the 9 For x=3,4 do
topic of class. 10 Form the candidate x - width word
phrases (Cx) from the frequent (x-1) –
3. A candidate key-phrase is chosen because it exists width word phrases (Lx-1)
only in few texts in the collection and it is quite 11 Choose the most frequent x – width
frequent within these texts but these few texts that word phrases (Lx) (P2 and W2,
contain the candidate key-phrase are spread within P3 and W3 – parameters)
a lot of classes. 12 End {For x=3,4 do}
As a first conclusion, the choice of the key-phrases 13 Compose an integrated list by joining Lx
(for x=2,3,4). This join, forms the
must not be based on frequent (within the whole text
frequent word phrases of class (LFC)
collection) candidate phrases or measures like the one 14 End {For every class of the training set}
used in KEA system. 15 Integrate / Join the lists of frequent word
Instead of using these frequent, in the whole phrases of all classes of the training set
16 Reject the frequent word phrases that exist
collection, phrases for text classification we can use in many classes (Pt - parameter). The rest
key-phrases which are frequent within the documents of the frequent word phrases form the
of one or few classes but do not be frequent in the set of key-phrases or «Authority list»
17 Form the dictionary of «Terms». It is the
documents of the rest classes in the training set. We list of stems that are components of the
also estimate that the selection of key-phrases based on key-phrases of the «Authority list».
some syntactic structures poses some extra complexity
(use of morphological part of speech taggers and Parameters:
P0 percentage of texts of the class that must Similarity based classification
contain a stem,
W1 width of window that covers 2-word phrases, Karanikolas and Skourlas [5] presented the idea that
P1 percentage of texts of the class that must the classification of new medical documents can be
contain a 2-words phrase, based to their similarity to existing documents (of a
W2 width of window that covers 3-word phrases, “training” set). Such an Instance based learning method
P2 percentage of texts of the class that must assumes that similar documents must be classified in the
contain a 3-words phrase, same category or (in other words) must share the same
W3 width of window that covers 4-word phrases, classification code (e.g. the same ICD code). According
P3 percentage of texts of the class that must to this approach the text collection is divided into a
contain a 4-words phrase, number of classes and each document of each class is
Pt percentage of classes that can contain a key- characterized by a number of key-phrases. For each
phrase. document in the collection the existing key-phrases in
the document can have a frequency, etc.
Adopting the idea proposed in Mannila et. al. our
algorithm works iteratively, alternating between The vector space model
building (steps 7 and 10) and recognition phases (steps
8 and 11). In the building phase of an iteration i, a In the popular vector space model a data set of n
collection Ci of new candidate key-phrases of distinct unique terms is specified, called the index terms of the
words is built, using the information available from document collection, and every text can be represented
smaller frequent key-phrases. Then, these candidate by a vector of weights of the terms in the document. In
key-phrases are recognized in the documents in the our case we use a set of m key-phrases instead of simple
class and their frequencies are computed. The terms and the vector representation of each document
collection Li consists of frequent key-phrases in Ci. In can be:
the next iteration i+1, candidate key-phrases in Ci+1 are
built using the information about the frequent key- (kp1, kp2, …, kpm)
phrases in Li. The algorithm starts by constructing C1
to contain all key-phrases consisting of single words. where kpj=1, if the key-phrase j is present in the text,
At the end of each step, the list of frequent key-phrases and 0 otherwise.
of the processed class is being built (step 13). At the A query is a new (unclassified) document (text) and
end, the algorithm composes the Authority list (steps can be represented in the same manner. The text and
15 and 16). query vectors can be envisioned as an n-dimensional
Steps 7 and 10 are based on the second (b) vector space. A vector matching operation, based on the
essential points of Mannila’s paper [6]. The set of cosine correlation used to measure the cosine of the
candidates 2-word phrases (C2) must contain key- angle between vectors can be used to compute the
phrases of length 2 (key-phrases including two stems similarity. Hence, the following equation (adapted from
of distinct words). To construct this set, we form the Lucarella D., 1988, [8]) gives us a well-known method
cartesian product and then remove all the tuples that to measure the similarity of a text Di of the training set
have the same elements. Figure 2 illustrates an against a new text Dnew (or query Q):
example of the application of step 10 of the algorithm.
In this case the construction of the sixth class of key- m m
S ( D i , Dnew )
a c k d f m m LDnew LDi
a g c k d f q kp
j 1
2
j
j 1
2
ij
a g c d f
where m is the number of key-phrases used in the
collection, kpij is equal to 1 if the key-phrase j exists in
a c k d f document Di (of the training set), otherwise is equal to 0
and qj is the weight of key-phrase j in the new document.
a b c k d f The following equation can be used to measure the
b c k d f b a c k d f term qj:
a c d f k ClassCount
q j log 2
ClassFreq
g a c d f k j
g a c d f
where ClassCount is the number of classes of the
training set, and ClassFreqj is the number of classes that
Figure 1: construction of C6 based on L5. include the key-phrase j.
Experimental results [3] Eibe Frank, et. al. Domain-Specific Keyphrase
Extraction. International Joint Conference of Artificial
In the next table (table 1) we depict classes of texts Intelligence, 1999.
of the training set classified by ICD 9 – codes. [4] N. Karanikolas and C. Skourlas. Automatic
Diagnosis Classification of patient discharge letters.
Table 1: Training Set MIE 2002: XVIIth International Congress of the
European Federation for Medical Informatics, August
Class ICD9 code that Number of documents 25-29, 2002, Budapest, Hungary.
number characterizes the (discharge letters) in
class class [5] N. Karanikolas, C. Skourlas, A. Christopoulou and
1 0010 4 T. Alevizos. Medical Text Classification based on Text
2 122.8 4 Retrieval techniques. MEDINF 2003. 1st International
3 151 4 Conference on Medical Informatics & Engineering,
4 153 4 October 9 - 11, 2003, Craiova, Romania.
5 153.3 5 [6] Heikki Mannila, Hannu Toivonen and A. Inkeri
6 154.1 4 Verkamo. Discovering frequent episodes in sequences.
7 155.0 4 KDD-95: First International Conference on Knowledge
Discovery and Data Mining, August 20-21, 1995,
First, we applied our algorithm to the training set Montreal, Canada.
(29 discharge letters) and the «Set of the key-phrases» /
«Authority list» was constructed. Then, every text of [7] Heikki Mannila and Hannu Toivonen. Discovering
the training set was submitted as a new text for generalized episodes using minimal occurrences. KDD-
classification (for assigning the appropriate ICD-9 96: Second International Conference on Knowledge
code). The similarity of the «new» text with all the Discovery and Data Mining, August, 1996, Portland,
texts of the training set was calculated using the Oregon. AAAI Press.
proposed measure of similarity. In the next table (table [8] Lucarella, D., 1988, A document retrieval system
2) we present the number of documents of the same based on nearest neighbour searching, Journal of
class with the «new» document. More precisely, we Information Science, 14, 25-33.
present the number of retrieved documents of the same
class with the «new» document that belong to the first
five more similar ones and the first ten more similar
ones, respectively. It seems that we have promissing /
encouraging results.
Table 2: Results
Funds
References