Learning To Classify Documents According To Genre: Aidan Finn and Nicholas Kushmerick
Learning To Classify Documents According To Genre: Aidan Finn and Nicholas Kushmerick
Learning To Classify Documents According To Genre: Aidan Finn and Nicholas Kushmerick
Abstract Current document retrieval tools succeed in locating large numbers of documents relevant to a given query. While search results may be relevant according to the topic of the documents, it is more difcult to identify which of the relevant documents are most suitable for a particular user. Automatic genre analysis - that is, the ability to distinguish documents according to style - would be a useful tool for identifying documents that are most suitable for a particular user. We investigate the use of machine learning for automatic genre classication. We introduce the idea of domain transfer - genre classiers should be reusable across multiple topics - which doesnt arise in standard text classication. We investigate different features for building genre classiers and their ability to transfer across multiple topic domains. We also show how different feature-sets can be used in conjunction with each other to improve performance and reduce the number of documents that need to be labeled.
1 Introduction
There is a vast amount of information available to the casual user today mainly due to the proliferation of the world wide web. However, it has become difcult to nd the information that is most appropriate to a given query. While users can usually nd relevant information, it is increasingly difcult to isolate information that is suitable in terms of style or genre. Current search services take a one size ts all approach, taking little account of the individual users needs and preferences. These techniques succeed in identifying relevant documents, but the large number of documents relevant to a given query can make it difcult to isolate those documents that are most relevant to that query. Achieving high recall while maintaining precision is very challenging. The huge volume of information available means that new techniques are needed to lter the relevant documents and identify the information that best satises a users information need. We explore the use of genre to address this issue. By genre we loosely mean the style of text in the document. A genre class is a class of documents that are of a similar
type. This classication is based not on the topic of the document, but rather on the kind of text used. We have identied automatic genre analysis as an additional tool that can complement existing techniques and improve the results returned to a user. Genre information could be used to lter or re-rank documents deemed relevant. The relevance of a particular document to a given query is dependent on the particular user issuing the query. We believe that the genre or style of text in a document can provide valuable additional information when determining which documents are most relevant to a particular users query. Machine learning has been widely used to categorize documents according to topic. In automatic text classication, a machine learning algorithm is given a set of examples of documents of different topics and it uses these examples to learn to distinguish documents. We consider the use of machine learning techniques to automatically categorize documents according to genre. The ability to identify the style of text used in a document would be a valuable service in any text retrieval system. For example, consider a query about chaos theory. Different users will require documents which assume different levels of expertise, depending on the users technical background. It would be useful to be able to rank documents according to the level of technical detail with which they present their subject. Current information retrieval systems would be greatly enhanced by the ability to lter documents according to their genre class. A high school student may require documents that are introductory or tutorial in style, while a college professor may require scholarly research documents. As another example, consider news ltering according to the topic of the article. Such a service would be improved by the ability to lter the news articles according to different genre classes. For example, consider a nancial analyst who tracks daily news about companies in which she is interested. It would be useful to be able to further classify these documents as being subjective or objective. One class of documents would present the latest news about the various companies of interest, while the other class would contain the opinions of various columnists and analysts about these companies. Depending on circumstances, the user may require documents of one class or the other. Another genre class with useful application is the ability to identify whether a document is describing something in a positive or negative way. This could be used to improve a recommender system. Products could be recommended on the basis that they were given a positive review by a reviewer with similar interests to the target user. Another application of review classication is ltering of newswire articles for nancial analysis. Financial analysts must quickly digest large amounts of information when making investment decisions. A delay of a few seconds in identifying important information can result in signicant gains or losses. The ability to automatically identify whether news about a company is positive or negative would be a valuable service in such a situation [10]. The ability to lter documents according to the level of technical information presented and the readability of the document would enable a system to personalize documents retrieved according to the users educational background. With a suitable set of genre classes, a system with a dual category structure that allowed users to browse documents according to both topic and genre would be useful. Genre analysis can facilitate 2
improved personalization by recommending documents that are written in a style that the user nds interesting or a style that is appropriate to the users needs. We consider genre to be complimentary to topic as a method of recommendation. The two used in conjunction with each other can improve the quality of a users recommendations. In this article we make the following contributions: To investigate the feasibility of genre classication using machine learning. We wish to investigate whether machine learning can successfully be applied to the task of genre classication. To investigate how well different feature-sets perform on the task of genre classication. Using two sample genre classication tasks, we perform experiments using three different feature-sets and investigate which features satisfy the criteria for building good genre classiers. To investigate the issues involved in building genre classiers with good domain transfer. The task of genre classication requires additional methods of evaluation. We introduce the idea of domain transfer as an indication of the performance of a genre classier across multiple topic domains. We evaluate each of the feature-sets for their ability to produce classiers with good domain transfer. To investigate how we can apply active learning techniques to build classiers that perform well with small amounts of training data. To investigate methods of combining multiple feature-sets to improve classier performance.
2 Genre Classication
In our introduction we gave a general outline of what we mean by genre. Here we dene our interpretation in more detail, give several examples and compare our denition with previous denitions from related research.
work 2: a style of expressing yourself in writing 3: a class of artistic endeavor having a characteristic form or technique. Swales [16] gives a working denition of genre. A genre is dened as a class of communicative events where there is some shared set of communicative purposes. This is a loose denition and any particular instance of a genre may vary in how closely it matches the denition. However instances of a genre will have some similarity in form or function. Karlgren [5] distinguishes between a style and a genre. A style is a consistent and distinguishable tendency to make certain linguistic choices. A genre is a grouping of documents that are stylistically consistent and intuitive to accomplished readers of the communication channel in question. From the different denitions we see that there is no denitive agreement on what is meant by genre. However, the common thread among these denitions is that genre relates to style. The genre of a document reects a certain style rather than being related to the content. In general this is what we mean when we refer to the genre of a document: the genre describes something about what kind of document it is rather than what topic the document is about. Genre is often regarded as orthogonal to topic. Documents that are about the same topic can be from different genres. Similarly, documents from the same genre can be about different topics. Thus we must separate the identication of the topic and genre of a document and try to build classiers that are topic-independent. This contrasts with the aim of other text classication tasks, thus the standard methods of evaluating text classiers are not completely appropriate. This suggests the notion of domain transfer - whether genre classiers trained on documents about one topic can successfully be applied to documents about other topics. We explicitly distinguish between the topic and style of the document. While assuming that genres are stylistically different, we investigate the effect of topic on our ability to distinguish genres. When we evaluate our genre classiers, we measure how well they perform across multiple topic domains. In order for genre classication techniques to be generally useful, it must be easy to build genre classiers. There are two aspects to this. The rst is that of domain transfer: classiers should be generally applicable across multiple topics. The second is that of learning with small amounts of training data. When building genre classiers, we want to achieve good performance with a small number of examples of the genre class. Genres depend on context and whether or not a particular genre class is useful or not depends on how useful it is for distinguishing documents from the users point-ofview. Therefore genres should be dened with some useful user-function in mind. In the context of the Web, where most searches are based on the content of the document, useful genre classes are those that allow a user to usefully distinguish between documents about similar topics. To summarize, we view a genre as a class of documents that arises naturally from the study of the language style and text used in the document collection. Genre is an abstraction based on a natural grouping of documents written in a similar style and is orthogonal to topic. It refers to the style of text used in the document. A genre class is a set of documents written in a similar style which serves some useful discriminatory function for users of the document collection. 4
Football
Fact Liverpool have revealed they have agreed a fee with Leeds United for striker Robbie Fowler - just hours after caretaker boss Phil Thompson had said that contract talks with the player were imminent. Al Gore picked up votes Thursday in Broward County as election ofcials spent Thanksgiving weekend reviewing questionable presidential ballots. In a move that sent Enron shares higher after days of double-digit declines, Dynegy conrmed Tuesday that it is in talks to renegotiate its $9 billion deal to buy its rival.
Politics
Finance
Opinion The departure of Robbie Fowler from Liverpool saddens me but does not surprise me. What did come as a shock, though, was that the club should agree terms with Leeds, one of their chief rivals for the Championship. Democrats are desperate and afraid. The reality that their nominee for President has a compulsive tendency to make things up to make himself look good is sinking in. The collapse of Enron is hard to believe, and even harder to understand. But in retrospect, there are some valuable lessons in the whole mess.
Table 1: Examples of objective and subjective articles from three topic domains Positive Almost Famous: Cameron Crowes rst lm since Jerry Maguire is so engaging, entertaining and authentic that its destined to become a rock-era classic. Set in 1973, this slightly ctionalized, semi-autobiographical, coming-ofage story revolves around a babyfaced 15 year old prodigy whose intelligence and enthusiasm land him an assignment from Rolling Stone magazine to interview Stillwater, an up-and-coming band. Though the New American menu at this neighbourhood treasure near Capitol Hill is ever changing, its always beautifully conceived and prepared and based on mostly organic ingredients; the bistro dishes, paired with a fabulous, descriptive wine list, are served in an offbeat atmosphere. Negative Vanilla Sky: Presumably Cameron Crowe and Tom Cruise have some admiration for Abre Los Ojos the 1998 Spanish thriller from Alejandro Amenabar; why else would they have chosen to do an Englishlanguage remake? Vanilla Sky, however shows that respect for ones source material isnt enough. Its a misbegotten venture that transforms a awed but intriguing original into an elephantine, pretentious mess. Hidden in the back of a shopping mall near Emory, this Chinese eatery is so isolated that diners sometimes feel as if theyre having a private meal out; the decor isnt much to look at and the foods nothing special but its decent.
Movie
Restaurant
Table 2: Examples of positive and negative reviews from two topic domains 6
Figure 1: Genre classication for domain transfer it is to be used in a new topic domain, the amount of work required to maintain it will be considerable and in a high volume digital library scenario could be prohibitive.
Stamatatos et al. [15] recognize the need for classiers that can easily transfer to new topic domains, without explicitly mentioning domain transfer. However they do not elaborate on how to evaluate transfer. Their notion of genre is similar to ours. Their feature-set is the most frequently occurring words of the entire written language and they show that the frequency of occurrence of the most frequent punctuation marks contains very useful stylistic information that can enhance the performance of an automatic text genre classier. This approach is domain and language independent and requires minimal computation. They do not perform any experiments to measure the performance of their classier when it is transferred to new topic domains. This work is closely related to ours. They identify the need for domain transfer but do not develop this idea any further. Their denition of text genre is similar to ours and two of the genre classes they identify are similar to our subjectivity classication task. The features they use, namely stop-words and punctuation, are similar to our text statistics feature-set. Kessler et al. [8] argue that genre detection based on surface cues is as successful as detection based on deeper structural properties. Argamon et al. [1] consider two types of features: lexical and pseudo syntactic. They compare the performance of function words against part-of-speech trigrams for distinguishing between different sets of news articles. Roussinov et al. [14] view genre as a group of documents with similar form, topic or purpose, a distinctive type of communicative action, characterized by a socially recognized communicative purpose and common aspects of form. This is a more general view of genre where genre is a grouping of similar documents. Some genres are dened in terms of purpose or function, others in terms of physical form while most documents combine the two. They attempt to identify genres that web users frequently face and propose a collection of genres that are better suited for certain types of information need. To this end, they performed a user survey to see 1) what is the purpose for which users search the web, and 2) whether there was a relation between the purpose of a respondents search and the genre of document retrieved. This results in a proposed set of genres, along with a set of features for each genre and a user interface for genre based searching. Dewdnew et al. [3] take the view that genre of a document is the format style. Genre is dened as a label which denotes a set of conventions in the way in which information is presented. The conventions cover both formatting and the style of language used. They use two feature-sets: a set based on words (traditional bag-ofwords) and a set of presentation features which represent stylistic information about the document. The presentation features do consistently better than the word frequency features and combining the feature-sets gives a slight improvement. They conclude that linguistic and format features alone can be used successfully for sorting documents into different genres. Rauber and Muller-Koller [13] argue that in a traditional library, non content-based information such as age of a document and whether it looks frequently used are important distinguishing features and present a method of automatic analysis based on various surface level features of the document. The approach uses a self-organizing map (SOM) [9] to cluster the documents according to structural and stylistic similarities. This information is then used to graphically represent documents. In this approach 8
the genres are identied from clusters of documents that occur in the SOM rather than being dened in advance. Karlgren [4, 6, 7] has done several experiments in genre classication. In [4] he shows that the texts that were judged relevant to a set of TREC queries differ systematically (in terms of style) from the texts that were not relevant. In [6], Karlgren et al. use topical clustering in conjunction with stylistics based genre prediction to build an interactive information retrieval engine and to facilitate multi-dimensional presentation of search results. They built a genre palette by interviewing users and identifying several genre classes that are useful for web ltering. The system was evaluated by users given particular search tasks. The subjects did not do well on the search tasks, but all but one reported that they liked the genre enhanced search interface. Subjects used the genres in the search interface to lter the search results. The search interface described is an example of how genre classication can usefully aid information retrieval.
interested in identifying attributes which perform well on the genre classication task, can be easily extracted automatically and are useful across multiple topics. We use C4.5 [12] as our main learning algorithm. C4.5 is a machine learning algorithm that induces a decision tree from labeled examples and can easily be converted to a set of rules for a human to analyze. We identify three different sets of features and investigate the utility of each of these for genre classication. Furthermore we attempt to identify the features which will lead to classiers that perform well across multiple topic domains and can easily be built automatically. We use two sample genre tasks to test the utility of three sets of features for the purpose of automatic genre classication. We emphasize the ability to transfer to new topic domains when building our classiers and we evaluate different feature-sets for performance across multiple topic domains. In addition to building classiers that will transfer easily to new domains, we wish to minimize the effort involved in building a genre classier. We wish to achieve good performance, that is, prediction accuracy as a function of amount of training data, with a minimum amount of labeled data. To this end we examine the learning rates of our classiers and investigate methods of improving this learning rate using active learning techniques. The three feature-sets investigated can be thought of as three independent views of the dataset. We investigate methods of combining the models built using each featureset to improve classier performance.
4 Features
We have explored three different ways to encode a document as a vector of features.
4.1 Bag-of-words
The rst approach represented each document as a bag-of-words (BOW), a standard approach in text classication. A document is encoded as a feature-vector, with each element in the vector indicating the presence or absence of a word in the document. We wish to determine how well a standard keyword based learner performs on this task. This approach led to feature-vectors that are large and sparse. We used stemming [11] and stop-word removal to reduce the size of the feature vector for our document collection. This approach to document representation works well for standard text classication where the target of classication is the topic of the document. In the case of genre classication however, the target concept is often independent of the topic of the document, so this approach may not perform as well. It is not obvious whether certain keywords would be indicative of the genre of the document. We are interested in investigating how well this standard text classication approach works on the genre classication tasks. We expect that a classier built using this feature-set may perform well in a single topic domain, but not very well when domain transfer is evaluated. By topic domain we mean a group of documents that can
10
be regarded as being about the same general subject or topic. For example, for the subjectivity classication task, we have three topic domains: football, politics and nance. For the review classication task we have two topic domains: restaurant reviews and movie reviews. The reason we identify different topic domains is that a text genre class may occur across multiple topic domains. We wish to evaluate the domain transfer of a genre classier. For example, if a classier is trained for the subjectivity classication task using documents from the football domain, how well does it perform when this classier is transferred to the new domain of politics? It is common in text classication, where the aim it to classify documents by content, to use a binary representation for the feature vector rather encoding the frequencies of the words occurrences. It is also common to lter out commonly occurring words as they do not usefully distinguish between topics. We are interested in measuring domain transfer so we choose the binary vector representation.
5 Experiments
We have evaluated the three feature-sets using two real-world genre classication tasks.
11
Description Coordinating conjunction Cardinal number Determiner Existential there Foreign word Preposition or subordinating conjunction Adjective Adjective, comparative Adjective, superlative List item marker Modal Noun, singular or mass Noun, plural Proper noun, singular Proper noun, plural pre-determiner Possessive ending Personal pronoun
Tag PP$ RB RBR RBS RP SYM TO UH VB VBD VBG VBN VBP VBZ WDT WP WP$ WRB
Description Possessive pronoun Adverb Adverb, comparative Adverb, superlative Particle Symbol to Interjection Verb, base form Verb, past tense Verb, gerund or present participle Verb, past participle Verb, non-3rd person singular present Verb, 3rd person singular present Wh-determiner Wh-pronoun Possessive wh-pronoun Wh-adverb
5.1 Evaluation
We evaluate our classiers using two measures: accuracy of the classier in a single topic domain and accuracy when trained on one topic domain but tested on another. 5.1.1 Single Domain Accuracy Single domain accuracy measures the accuracy of the classier when it is trained and tested on instances from the same topic domain. This measure indicates the classiers ability to learn the classication task in the topic domain at hand. Accuracy is dened as the percentage of the classiers predictions that are actually correct as measured against the known classes of the test examples. Accuracy is measured using ten-fold cross-validation. 5.1.2 Domain Transfer Accuracy Note that single-topic accuracy give us no indication of how well our genre classier will perform on documents from other topic domains. We introduce a new evaluation measure, domain transfer, which indicates the classiers performance on documents from other topic domains. We measure domain transfer in an attempt to measure the classiers ability to generalize to new domains. For example, a genre classier built using documents
12
Document level statistics: sentence length, number of words, word length Frequency counts of various function words: because been being beneath can cant certainly completely could couldnt did didnt do does doesnt doing dont done downstairs each early enormously entirely every extremely few fully furthermore greatly had hadnt has hasnt havent having he her herself highly him himself his how however intensely is isnt it its itself large little many may me might mighten mine mostly much musnt must my nearly our perfectly probably several shall she should shouldnt since some strongly that their them themselves therefore these they this thoroughly those tonight totally us utterly very was wasnt we were werent what whatever when whenever where wherever whether which whichever while who whoever whom whomever whose why will wont would wouldnt you your Frequency counts of various punctuation symbols Figure 3: Text statistic features
13
Table 3: Corpus details for the subjectivity classication experiment about football should be able to recognize documents about politics from the same genre. Domain transfer is essential in a high volume digital library scenario as it may be prohibitively expensive to train a separate genre classier for every topic domain. We use the domain transfer measure as an indicator of the classiers generality. It also gives us an indication of how much the genre classication task in question is topic dependent or topic independent. Domain transfer is evaluated by training the classier on one topic domain and testing it on another topic domain. In addition to measuring the domain transfer accuracy, we can calculate the domain transfer rate. This measures how much the classiers performance degrades when the classier is evaluated on new topic domains. A classier that performs equally well in the transfer condition as in a single domain would achieve a transfer score of 1. A classier whose performance degrades when transferred to new topic domains would achieve a transfer score of less than 1. Consider a classication task consisting of a learning algorithm C and a set of features F. Let D1 , D2 , ..., Dn be a set of topic domains. Let D AD be the performance of C when evaluated using ten-fold cross-validation in domain D. Let D1 AD2 denote the performance of classication scheme C when trained in domain D1 and tested in domain D2 . We will use accuracy as our measure of performance. We dene the domain transfer rate for classication scheme C as
C DTF =
1 n(n 1)
Di ADj Di ADi
(1)
i=1 j =1;j =i
We evaluate the quality of a genre classier using both single domain accuracy and domain transfer accuracy. Ideally we would hope to get high single domain accuracy and a high domain transfer rate (see gure 4). A classier with high accuracy and low transfer may be useful in some situations but not in others.
Table 4: Corpus details for the review classication experiment jectivity datasets. The reason is that the classication of a document could be extracted automatically using a wrapper for the particular site. For example, most movie reviews come with a recommendation mark. A review that awards a lm 4 stars could be considered a positive review, while a review that awards a lm 1 star could be considered a negative review. Thus we automatically extract the classication of a particular review, negating the need to manually classify each document. The Movie reviews were downloaded from the Movie Review Query Engine1 . This site is a search engine for movie reviews. It extracts movie reviews from a wide range of sites. If the review contains a mark for the lm, the mark is also extracted. We wrote a wrapper to extract a large number of movie reviews and their corresponding marks from this site. The marks from various sites were normalized by converting them to a percentage and then we used documents with high percentages as examples of positive reviews and vice versa. Marks below 41 were considered negative while marks of 100% were considered positive. Reviews with marks in the range 41-99 were ignored as many of them would require a human to label them as positive or negative. The restaurant reviews were gathered from the Zagat survey site2 . This is a site that hosts a survey of restaurants from the U.S.A. and Europe. Users of the site submit their comments about a particular restaurant and assign marks in three categories (food, decor and service). The marks for these categories are between 1 and 30 and are the average for all the users that have provided feedback on that particular restaurant. The reviews themselves consist of an amalgamation of different users comments about the restaurant. We averaged the marks for the three categories to get a mark for each restaurant. Restaurants that got an average mark below 15 were considered negative while those getting marks above 23 were considered positive.
15
2
Single Domain Accuracy
4
Domain Transfer Rate
Figure 4: Desirable performance characteristics of a genre classier classier is trained and tested on documents from the same topic domain. In the domain transfer case, the classier is trained on documents from one topic domain and tested on documents from another topic domain. Usually text classication is applied to tasks where topic specic rules are an advantage. In order to scale with large numbers of topics, this is not the case for genre classication. In the case of genre classication, topic specic rules reduce the generality of the genre classier. In addition to evaluating the genre classiers performance in a single topic domain, we also need to evaluate its performance across multiple topic domains. Figure 4 shows single domain accuracy plotted against domain transfer rate, with areas of the graph labeled in order of desirability. The most desirable classiers would be in region 1 of the graph. These classiers have both high single domain accuracy and high transfer rate. Next most desirable are classiers occurring in region 2 of the graph. These classiers have high single domain accuracy but poor domain transfer. Classiers occurring in regions 3 and 4 are undesirable because they have poor levels of single domain accuracy and even if they have good transfer rates, they are not useful in practice.
16
Table 5: Single domain accuracy for subjectivity classication BOW 76.8 88.5 82.7 POS 59.6 62.9 61.3 TS 59 94.1 76.6 MVE 74.1 83.4 78.8
Table 6: Single domain accuracy for review classication average, although the difference between it and the TS feature-set is insignicant. All three feature-sets achieve good accuracy on this classication task, indicating that any of these feature-sets alone is sufcient for building classiers within a single topic domain. However BOW is the best performing feature-set on this task within topic domains. This indicates that there are keywords within each topic domain that indicate the subjectivity of a document. Table 6 shows the single domain results for the review classication experiment. In both domains, the BOW approach performs signicantly better than the POS approach. On average, the BOW approach achieves accuracy of 82.7%. This is a good level of accuracy for this classication task. The POS approach performs poorly in comparison (61.3% on average). The BOW approach is capable of achieving good levels of performance when attempting to classify reviews as positive or negative in a single topic domain. The POS approach performs poorly on this classication task, even in a single topic domain. The TS approach performs well in the restaurant domain (94.1) but poorly on the movie domain (59). Thus while its average performance is good, it does not perform consistently well in each domain.
Table 7: Domain transfer for subjectivity classication Train Movie Rest Average Test Rest Movie BOW 40.1 55.5 47.8 POS 44.4 49.8 47.1 TS 50.4 44.3 47.35 MVE 45.3 52.9 49.1
Table 8: Domain transfer for review classication built using keywords or domain-specic hand-crafted features. Table 8 shows the domain transfer results for the review classication experiment. On average, each feature-set performs to a similar level with there being less than 1% between them. Each feature-set achieves average accuracy of around 47%. This level of performance is no better than that achievable by a simple majority classier. The single domain experiment on this classication task showed that BOW can achieve high levels of accuracy on this classication task in a single topic domain. However the domain transfer experiment shows that the BOW approach fails when the transfer approach is evaluated. The BOW features which indicate a positive movie review are not transferable to the restaurant domain and vice versa. The POS approach fails in both the single domain and domain transfer experiments. We conclude that the POS approach is not suitable for the task of classifying reviews as being either positive or negative. The BOW approach can achieve good performance in a single topic domain but cannot transfer to new topic domains. Even though the traditional means of evaluating a classier indicate that the BOW achieves good performance, our experiments indicate that it performs poorly when we our extra domain transfer condition is evaluated.
5.7 Discussion
Our experiments show that it is possible to build genre classiers that perform well within a single topic domain. However, single domain performance can be deceiving. When we further evaluate the classiers for domain transfer performance, it becomes clear that good domain transfer is more difcult to achieve. The review classication task is more difcult than the subjectivity classication 18
task. All feature-sets achieved good single domain accuracy on the latter task, while the POS feature-set also achieved good domain transfer. On the review classication task, the BOW approach achieved good single domain accuracy, but none of the feature-sets achieved good domain transfer. From examination of the dataset, reviews from the movie domain are easily recognizable by a human reader as being either positive or negative. It is more difcult to discern the category for many of the restaurant reviews. Recall that the reviews were classied automatically, based on scores extracted from the source website. The restaurant reviews consisted of an amalgamation of user comments about particular restaurant. For many of these reviews it is difcult for a reader to decide whether they are positive or negative. Because they combine different user comments, the style of the restaurant reviews is different from the style of the movie reviews which are written by individual authors. This may account for some poor performance when domain transfer was evaluated for the review classication task. It is also clear that no one feature-set is suitable for both genre classication tasks. The BOW feature-set performs well in a single topic domain, while the POS feature-set performs best on the subjectivity classication task when we evaluate domain transfer.
19
classier when transferred to the movie domain as it seems plausible that movie reviews containing the word romantic are more likely to be negative rather than positive. For the subjectivity classication task, the POS approach generates trees with root nodes DT, RB and RB for the football, politics and nance domains respectively. DT refers to the distribution of determiners (e.g. as, all, any, each, the, these, those). RB refers to adverbs (e.g. maddeningly, swiftly, prominently, predominately). Subjective documents tend to have relatively more determiners and adverbs. On the review classication task, the POS approach failed to accurately discriminate between positive and negative reviews. The TS approach generates trees with root nodes based on the number of words in the document for the football and politics domains and the distribution of the word can for the nance domain. Shorter documents are more likely to be objective. It seems likely that objective documents will often be much shorter than subjective documents as they just report some item of news, without any discussion of the event involved. It is not clear how the distribution of the word can is indicative of the subjectivity of a document. On the review classication task, the TS approach did not perform well in the movie domain (59), but performed surprisingly well on the restaurant domain (94.1). In this case the root node of the generated tree is the number of long words in the document. Reviews containing a small number of long words are more likely to be negative.
20
Our approach differs from these in that we will combine models based on our different feature-sets. This multi-view ensemble learning approach builds a model based on each of the three feature-sets. A majority vote is taken to classify a new instance. The results achieved by the ensemble learner are encouraging. For the subjectivity classication task the results (Table 5) achieved by this approach (MVE) are better that those achieved by any of the individual feature-sets. The domain transfer (Table 7) is almost as good as that achieved by POS, and signicantly better that that achieved by the other feature-sets. For the review classication task (Table 6) this approach performs better than POS and TS, but not a good as BOW. In the domain transfer case (Table 8), this approach performs best on average. This approach to classication exploits the fact that the three different feature-sets do not all make mistakes on the same documents. So a mistake made by the model based on one feature-set can be corrected by the models based on the other feature-sets. This works best in situations where all three feature-sets achieve good performance, such as the subjectivity classication task. When each feature-set performs well, they are more likely to correct each others mistakes. In cases where some of the feature-sets perform poorly (such as the review classication task), this approach will achieve performance that is proportional to the relative performance of the individual feature-sets. It seems likely that for genre classication tasks where it is not clear which featureset is most suitable for the task, this approach will increase the likelihood of the classier performing well.
21
22
Applying this approach to our subjectivity classication task gives an improvement in learning rate for all three feature-sets (BOW_al, POS_al, TS_al). For each featureset, there is little difference between the random and active learning approaches initially. However as the classication accuracy improves, the active learning approach begins to exhibit a better learning rate that the random approach. This indicates that the active learning approach consistently chooses documents that improve the performance of the classier.
7 Conclusion
In theory, genre and topic are orthogonal. However, our experiments indicate that in practice they partially overlap. It may be possible to automatically identify genre in a topic independent way, but the results of our domain transfer experiments show that the feature-sets we investigate result in models that are partially topic dependent. From a single topic point of view, our approach was very successful. If we used only the usual methods of evaluation, we would conclude that genre classication is not a difcult task and can easily be achieved using standard machine learning techniques. On the subjectivity classication task, all our feature-sets achieved high accuracy, while on the review classication task a standard bag-of-words approach achieved good accuracy. We have argued that standard methods of evaluation are not sufcient when evaluating genre classiers and that in addition the genre classiers ability to transfer to new topic domains must also be evaluated. When we evaluate this additional aspect of the genre classiers, we nd that it is difcult to build classiers that transfer well to new domains. For the subjectivity classication task we have shown that it is possible to build a genre classier that can automatically recognize a document as being either subjective or objective. High accuracy in a single topic domain can be achieved using any of the three feature-sets we investigated (BOW, POS or TS) but when domain transfer is measured for this task, the POS feature-set performs best. Overall, the POS feature-set is best for this genre classication task as it performs well both in a single topic domain and when transferred to new topic domains. The review classication task is more difcult. Good accuracy can be achieved in a single topic domain using the BOW approach. The POS approach is not suitable for this genre classication task. All three feature-sets fail to achieve good domain transfer on this task. We also investigated methods of combining the predictions of models based on the different feature-sets and show that this improves performance. This approach is perhaps best when approaching a new genre classication problem, where it is not clear which feature-set is most suitable for the task. We also show that the learning rate of the genre classier can be improved by actively selecting which document to add to the training set. This selection is based on the level of disagreement of models built using each feature-set. These two approaches further facilitate the aim of automating as much as possible the process of building genre classiers. All three feature-sets can be extracted auto23
matically. The ensemble learning approach can give good performance on the genre classication task and the active learning approach can improve performance on small amounts of training data.
Future work
We identied two sample genre classication tasks. These particular genre classes could be usefully applied to improve existing information retrieval systems. Applications that utilize genre classication to provide noticeable benets to the end user must be developed to determine whether genre classication can be a useful, practical technique for improving document retrieval systems. In building such systems it will be useful to identify additional genres that can improve a users ability to lter documents and reduce the number of documents that are potentially relevant to them. An expanded genre taxonomy is needed together with appropriate techniques for automatically identifying genres. We found that the techniques that were successful on one genre classication task (subjectivity classication), were less successful on another genre classication task (review classication). The ability to achieve good domain transfer is important for genre classiers. The techniques we used did not provide a complete separation of genre and topic. Further investigation is needed to determine methods of identifying genre in a topic independent way. We also need to rene methods of evaluating domain transfer and determine how to meaningfully compare the performance of different genre classiers. Ideally once a general genre taxonomy is dened we need techniques for automatically constructing genre classiers within this taxonomy. One would hope that there are general techniques that could be used to build all classiers for all genres within a taxonomy and that these genre classiers will transfer easily to new topic domains. However, our experience has shown that this is difcult and methods for achieving these aims need further investigation. Other feature-sets could be generally useful for building genre classiers. The addition of further feature-sets may also improve the performance of the ensemble learner and active learning approaches. In general future work consists of extending the work we have done on two genre classication tasks to a general genre taxonomy. Classiers built to identify genre classiers within this genre taxonomy should be easy to build and domain independent. The other major area for future work is to implement applications that use genre classication to improve the users experience. Acknowledgments This research was funded by Science Foundation Ireland and the US Ofce of Naval Research. Thanks to Barry Smyth for his advice and assistance.
24
References
[1] Shlomo Argamon, Moshe Koppel, and Galit Avneri. Routing documents according to style. In First International Workshop on Innovative Information Systems, 1998. [2] Eric Brill. Some advances in transformation-based parts of speech tagging. In AAAI, 1994. [3] Nigel Dewdney, Carol VanEss-Dykema, and Richard McMillan. The form is the substance: Classication of genres in text. In ACL Workshop on Human Language Technology and Knowledge Management, 2001. [4] J. Karlgren. Stylistic experiments in information retrieval. In T. Strzalkowski, editor, Natural Language Information Retrieval. Kluwer, 1999. [5] Jussi Karlgren. The wheres and whyfores for studying text genre computationally. In Style and Meaning in Language, Art, Music and Design, Washington D.C., 2004. AAAI Symposium series. [6] Jussi Karlgren, Ivan Bretan, Johan Dewe, Anders Hallberg, and Niklas Wolkert. Iterative information retrieval using fast clustering and usage-specic genres. In Eight DELOS workshop on User Interfaces in Digital Libraries, pages 8592, Stockholm, Sweden, 1998. [7] Jussi Karlgren and Douglass Cutting. Recognizing text genres with simple metrics using discriminant analysis. In Proceedings of the 15th. International Conference on Computational Linguistics (C OLING 94), volume II, pages 10711075, Kyoto, Japan, 1994. [8] Brett Kessler, Geoffrey Nunberg, and Hinrich Schutze. Automatic detection of text genre. In ACL/EACL, 1997. [9] Teuvo Kohonen. The self-organising map. Proceedings of IEEE, 78(9):1464 1479, 1990. [10] V. Lavrenko, M. Schmill, D. Lawrie, P. Ogilvie, D. Jensen, and J. Allan. Mining of concurrent text and time-series. In ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, 2000. [11] M. F. Porter. An algorithm for sufx stripping. Program, 14(3):130137, 1980. [12] Ross Quinlan. C4.5: Programs for Machine Learning. Morgan Kaufman, 1993. [13] A. Rauber and A. Muller-Kogler. Integrating automatic genre analysis into digital libraries. In First ACM-IEEE Joint Conf on Digital Libraries, 2001. [14] Dmitri Roussinov, Kevin Crosswell, Mike Nilan, Barbara Kwasnik, Jin Cai, and Xiaoyong Liu. Genre based navigation of the web. In 34th International Conference on System Sciences, 2001.
25
[15] E. Stamatatos, N. Fakotakis, and G. Kokkinakis. Text genre detection using common word frequencies. In 18th International Conference on computational Linguistics, 2000. [16] John M. Swales. Genre Analysis. Cambridge University Press, 1990. [17] Richard M. Tong. An operational system for detecting and tracking opinions in on-line discussions. In SIGIR Workshop on Operational Text Classication Systems, 2001. [18] Janyce M. Wiebe. Learning subjective adjectives from corpora. In AAAI, 2000.
26