0% found this document useful (0 votes)
10 views

Classification Algorithms - Unit III P3

Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
10 views

Classification Algorithms - Unit III P3

Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 10

Content-based Recommendation - Classification algorithms:

- Another way of deciding whether or not a document will be of interest to a user is to


view the problem as a classification task, in which the possible classes are “like”
and “dislike”.
- Once the content-based recommendation task has been formulated as a classification
problem, various standard (supervised) machine learning techniques can, in principle,
be applied such that an intelligent system can automatically decide whether a user will
be interested in a certain document.
- Supervised learning means that the algorithm relies on the existence of training data,
in our case a set of (manually labeled) document-class pairs.

Probabilistic methods:
- The most prominent classification methods developed in early text classification
systems are probabilistic ones.
- These approaches are based on the naive Bayes assumption of conditional
independence (with respect to term occurrences) and have also been successfully
deployed in content-based recommenders.
- The basic formula to compute the posterior probability for document classification is

- The possible classes are, of course, “like” and “dislike” (named hot and cold in
some articles).
- Documents are represented by Boolean feature vectors that describe whether a
certain term appeared in a document; the feature vectors are limited to the 128 most
informative words.
- Thus, in the model, P(vi |C = c) expresses the probability of term vi appearing in a
document labeled with class c. The conditional probabilities are again estimated by
using the observations in the training data.
- Table depicts a simple example setting. The training data consist of five manually
labeled training documents.
- Document 6 is a still-unlabeled document. The problem is to decide whether the
current user will be interested i.e., whether to recommend the item.
- To determine the correct class, compute the class-conditional probabilities for the
feature vector X of Document 6 again as follows:
P(X|Label=1) = P(recommender=1|Label=1) ×
P(intelligent=1|Label=1) ×
P(learning=0|Label=1) × P(school=0|Label=1)
= 3/3 × 2/3 × 1/3 × 2/3
≈ 0.149

Classification based on Boolean feature vector


Doc-ID recommender intelligent learning school Label
1 1 1 1 0 1
2 0 0 1 1 0
3 1 1 0 0 1
4 1 0 1 1 1
5 0 0 0 1 0
6 1 1 0 0 ?

Page 1 of 10
- The same can be done for the case Label = 0, and we see in the simple example that it
is more probable that the user is more interested in documents (for instance, web
pages) about intelligent recommender systems than in documents about learning in
school.
- In real applications some sort of smoothing must be done for sparse datasets such that
individual components of the calculation do not zero out the probability values.
- Of course, the resulting probability values can be used not only to decide whether a
newly arriving document – in, for instance, a news filtering system – is relevant but
also to rank a set of not-yet seen documents.
- In Collaborating Filtering, however, the classifier is commonly used to determine the
membership of the active user in a cluster of users with similar preferences (by means
of a latent class variable),
- Whereas in content-based recommendation the classifier can also be directly used to
determine the interestingness of a document.
- the core assumption of the naive Bayes model that the individual events are
conditionally independent does not hold because there exist many term co-
occurrences that are far more likely than others – such as the terms Hong and Kong
or New and York.
- Nonetheless, the Bayes classifier has been shown to lead to surprisingly good results
and is broadly used for text classification.
- “The paradox is explained by the fact that classification estimation is only a function
of the sign (in binary case) of the function estimation; the function approximation
can still be poor while classification accuracy remains high.”

[Paradox:"Men work together whether they work together or apart." - Robert Frost]

- Besides the good accuracy that can be achieved with the naive Bayes classifier, a
further advantage of the method – and, in particular, of the conditional independence
assumption – is that
o the components of the classifier can be easily updated when new data are
available and
o the learning time complexity remains linear to the number of examples;
o the prediction time is independent of the number of examples
- However, with most learning techniques, to provide reasonably precise
recommendations, a certain amount of training data (past ratings) is required.
- The “cold-start” problem also exists for content-based recommenders that require
some sort of relevance feedback.
- Possible ways of dealing with this are, for instance, to let the user manually label a set
of documents – although this cannot be done for hundreds of documents – or to ask
the user to provide a list of interesting words for each topic category.

- The Boolean representation of document features has the advantage of simplicity but,
of course, the possibly important information on how many times a term occurred in
the document is lost at this point.

- In the Syskill & Webert system, which relies on such a Boolean classifier for each
topic category, the relevance of words is taken into account only when the initial set
of appropriate keywords is determined.
- Afterward, the system cannot differentiate anymore whether a keyword appeared only
once or very often in the document.

Page 2 of 10
- In addition, this model also assumes positional independence – that is, it does not take
into account where the term appeared in the document.
Classification example with term counts
DocID Words Label
1. recommender intelligent recommender 1
2. recommender recommender learning 1
3. recommender school 1
4. teacher homework recommender 0

5. recommender recommender recommender teacher homework ?

- Other probabilistic modeling approaches overcome such limitations. Consider for


instance, the classification method in the above Table, in which the number of term
appearances shall also be taken into account.
- The conditional probability of a term vi appearing in a document of class C shall be
estimated by the relative frequency of vi in all documents of this class:

- where CountTerms(vi , docs(c)) returns the number of appearances of term vi in


documents labeled with c and AllTerms(docs(c)) returns the number of all terms in
these documents.
- To prevent zeros in the probabilities, Laplace (add-one) smoothing shall be applied in
the example:

- where |V | is the number of different terms appearing in all documents (called the
“vocabulary”).We calculate the conditional probabilities for the relevant terms
appearing in the new document as follows: the total length of the documents classified
as “1” is 8, and the length of document 4 classified as “0” is 3. The size of the
vocabulary is 6.

- The prior probabilities of a document falling into class 1 or class 0 are ¾ and ¼,
respectively. The classifier would therefore calculate the posterior probabilities as

- and therefore classify the unlabeled document as being relevant for the user.
- The classifier has taken the multiple evidence of the term “recommender” into
account appropriately.

Page 3 of 10
- If only the Boolean representation had been used, the classifier would have rejected
the document, because two other terms that appear in the document (“homework”, “teacher”)
suggest that it is not relevant, as they also appear in the rejected document 4.

Relation to text classification:


- The problem of labeling a document as relevant or irrelevant in our document
recommendation scenarios can be seen as a special case of the more broader and older
text classification (text categorization or topic spotting) problem, which consists of
assigning a document to a set of predefined classes.
- Applications of these methods can be found in information retrieval for solving
problems such as personal e-mail sorting, detection of spam pages, or sentiment
detection.
- Different techniques of “supervised learning”, such as the probabilistic one described
previously, have been proposed.
- The basis for all the learning techniques is a set of manually annotated training
documents and the assumption that the unclassified (new) documents are somehow
similar to the manually classified ones.
- When compared with the described “like/dislike” document recommendation
problem, general text classification problems are not limited to only two classes.
- Moreover, in some applications it is also desirable to assign one document to more
than one individual class.

- As noted earlier, probabilistic methods that are based on the naive Bayes assumption
have been shown to be particularly useful for text classification problems.
- The idea is that both the training documents and the still unclassified documents are
generated by the probability distributions.
- Basically, two different ways of modeling the documents and their features have been
proposed: the multinomial model and the Bernoulli model.
- The main differences between these models are the “event model” and, accordingly,
how the probabilities are estimated from the training data.

- In the multivariate Bernoulli model, a document is treated as a binary vector that


describes whether a certain term is contained in the document.
- In the multinomial model the number of times a term occurred in a document is
also taken into account, as in our earlier example.
- In both cases, the position of the terms in the document is ignored.
- Empirical evaluations show that the multinomial model leads to significantly better
classification results than does the Bernoulli model in particular when it comes to
longer documents and classification settings with a higher number of features

- Finally, another interesting finding in probabilistic text classification is that not only
can the manually labeled documents can be used to train the classifier, but still-
unlabeled documents can also help to improve classification.
- In the context of content-based recommendation this can be of particular importance,
as the training set of manually or implicitly labeled documents is typically very small
because every user has his or her personal set of training examples.

Other linear classifiers and machine learning:


- When viewing the content-based recommendation problem as a classification
problem, various other machine learning techniques can be employed.

Page 4 of 10
- At a more abstract level, most learning methods aim to find coefficients of a linear
model to discriminate between relevant and non-relevant documents.

- Figure sketches the basic idea in a simplified setting in which the available documents
are characterized by only two dimensions.
- If there are only two dimensions, the classifier can be represented by a line.
- The idea can, however, also easily be generalized to the multidimensional space in
which a two-class classifier then corresponds to a hyper-plane that represents the
decision boundary.
- In two-dimensional space, the line that we search for has the form w1x1 + w2x2 = b
where x1 and x2 correspond to the vector representation of a document (using, e.g.,
TF-IDF weights) andw1,w2, and b are the parameters to be learned.
- The classification of an individual document is based on checking whether for a
certain document w1x1 + w2x2 > b, which can be done very efficiently.
- In n-dimensional space, a generalized equation using weight and feature vectors
instead of only two values is used, so the classification function is wT x = b.

- Many text classification algorithms are actually linear classifiers, and it can easily be
shown that both the naive Bayes classifier and the Rocchio method fall into this
category.
- Other methods for learning linear classifiers are, for instance, theWidrow-Hoff
algorithm or support vector machines.
- The kNN nearest-neighbor method, on the other hand, is not a linear classifier.
- In general, infinitely many hyperplanes exist that can be used to separate the
document space.
- The aforementioned learning methods will typically identify different hyperplanes,
which may in turn lead to differences in classification accuracy.
- In other words, although all classifiers may separate the training data perfectly, they
may show differences in their error rates for additional test data.
- Implementations based on SVM, for instance, try to identify decision boundaries that
maximize the distance (called margin) to the existing data-points, which leads to very
good classification accuracy when compared with other approaches.

- Another challenge when using a linear classifier is to deal with noise in the data.
There can be noisy features that mislead the classifier if they are included in the
document representation.
- In addition, there might also be noise documents that, for whatever reason, are not
near the cluster where they belong.

Page 5 of 10
- The identification of such noise in the data is, however, not trivial.

- Despite the fact that in these experiments some algorithms, and in particular SVM
based ones, performed better than others, there exists no strict guideline as to which
technique performs best in every situation.
- Moreover, it is not always clear whether using a linear classifier is the right choice at
all, as there are, of course, many problem settings in which the classification borders
cannot be reasonably approximated by a line or hyperplane.
- Overall, “selecting an appropriate learning method is therefore an unavoidable part of
solving a text classification problem”

Explicit decision models:


Two other learning techniques that have been used for building content-based
recommender systems are based on decision trees and rule induction. They differ from
the others insofar as (to the extent that) they generate an explicit decision model in the
training phase.
Decision tree learning based on ID3 or the later C4.5 algorithms has been successfully
applied to many practical problems, such as data mining problems.
When applied to the recommendation problem, the inner nodes of the tree are labeled
with item features (keywords), and these nodes are used to partition the test examples
based, for instance, simply on the existence or nonexistence of a keyword in the
document.
In a basic setting only two classes, interesting or not, might appear at the leaf nodes.
Figure depicts an example of such a decision tree.

- Determining whether a new document is relevant can be done very efficiently with
such a prebuilt classification tree, which can be automatically constructed (learned)
from training data without the need for formalizing domain knowledge.

- Further general advantages of decision trees are that they are well understood, have
been successfully applied in many domains, and represent a model that can be
interpreted relatively easily.
- The main issue in the content-based recommendation problem setting is that we have
to work on relatively large feature sets using, for instance, a TF-IDF document
representation.

Page 6 of 10
- Decision tree learners, however, work best when a relatively small number of features
exist, which would be the case if we do not use a TF-IDF representation of a
document but rather a list of “meta”- features such as author name, genre, and so
forth.
- An experimental evaluation actually shows that decision trees can lead to comparably
poor classification performance.
- The main reason for this limited performance on large feature sets lies in the typical
splitting strategy based on the information gain, which leads to a bias toward small
decision trees.
- For these reasons, decision trees are seldom used for classical content based
recommendation scenarios.
- One of the few exceptions is the work of Kim et al., in which decision trees were used
for personalizing the set of advertisements appearing on a web page.
- Still, even though decision trees might not be used directly as the core
recommendation technique, they can be used in recommender systems in combination
with other techniques to improve recommendation efficiency or accuracy.
- Decision trees are used to compress in-memory data structures for a recommender
system based on frequent item set mining;
- Decision trees is used to determine which user model features are the most relevant
ones for providing accurate recommendations in a content-based collaborative hybrid
news recommender system.
- Thus, the learning task in this work is to improve the recommendation model itself.

Rule induction:
- Rule induction is a similar method that is used to extract decision rules from training
data.
- Methods built on the RIPPER algorithm have been applied with some success for e-
mail classification, which is, however, not a core application area of recommender
systems.
- As mentioned by Pazzani and Billsus, the relatively good performance when
compared with other classification methods can be partially explained by the
elaborate post pruning techniques of RIPPER itself and a particular extension that was
made for e-mail classification that takes the specific document structure of e-mails
with a subject line and a document body into account.
- A more recent evaluation and comparison of e-mail classification techniques can be
found in Koprinska et al., which shows that “random forests” (instead of simple trees)
perform particularly well on this problem

- Both decision tree learning and rule induction have been successfully applied to
specific subproblems such as e-mail classification, advertisement personalization, or
cases in which small feature sets are used to describe the items, which is a common
situation in knowledge-based recommenders.
- In these settings, two of the main advantages of these learning techniques are that
o (a) the inferred decision rules can serve as a basis for generating explanations
for the system’s recommendations and
o (b) existing prior domain knowledge can be incorporated in the models.

On feature selection:
- All the techniques described so far rely on the vector representation of documents
and on TF-IDF weights.

Page 7 of 10
- When used in a straightforward way, such document vectors tend to be very long
(there are typically thousands of words appearing in the corpus) and very sparse (in
every document only a fraction of the words is used), even if stop words are removed
and stemming is applied.
- In practical applications, such long and sparse vectors not only cause problems with
respect to performance and memory requirements, but also lead to an effect called
overfitting.
- Consider an example in which a very rare word appears by pure chance only in
documents that have been labeled as “hot”.
- In the training phase, a classifier could therefore be misled in the direction that this
word (which can, in fact, be seen as some sort of noise) is a good indicator of the
interestingness of some document.
- Such overfitting can easily appear when only a limited number of training documents
is available.

- Therefore, it is desirable to use only a subset of all the terms of the corpus for
classification.
- This process of choosing a subset of the available terms is called feature selection.
- Different strategies for deciding which features to use are possible.
- Feature selection in the Syskill & Webert recommender system mentioned earlier
(Pazzani and Billsus 1997), for instance, is based on domain knowledge and lexical
information from WordNet.
- The evaluation reported by them shows not only that the recommendation accuracy is
improved when irrelevant features are removed, but also that using around 100
“informative” features leads to the best results.

- Another option is to apply frequency-based feature selection and use domain- or task-
specific heuristics to remove words that are “too rare” or appear “too often” based on
empirically chosen thresholds.

- For larger text corpora, such heuristics may not be appropriate, however, and more
elaborate, statistics-based methods are typically employed.
- In theory, one could find the optimal feature subset by training a classifier on every
possible subset of features and evaluate its accuracy.
- Because such an approach is computationally infeasible, the value of individual
features (keywords) is rather evaluated independently and a ranked list of “good”
keywords, according to some utility function, is constructed.
- The typical measures for determining the utility of a keyword are the χ2 test, the
mutual information measure, or Fisher’s Discrimination Index.

χ2 contingency table
Term t appeared Term t missing
Class “relevant” A B
Class “irrelevant” C D
- Consider, for example, the χ2 test, which is a standard statistical method to check
whether two events are independent.
- The idea in the context of feature selection is to analyze, based on training data,
whether certain classification outcomes are connected to a specific term occurrence.
- When such a statistically significant dependency for a term can be identified, we
should include this term in the feature vector used for classification.

Page 8 of 10
- In our problem setting, a 2 × 2 contingency table of classification outcomes and
occurrence of term t can be set up for every term as in Table when we assume a
binary document model in which the actual number of occurrences of a term in a
document is not relevant.
- The symbols A to D in the table can be directly taken from the training data: Symbol
- A stands for the number of documents that contained term t and were classified as
relevant, and
- B is the number of documents that were classified as relevant but did not contain the
term.
- Symmetrically, C and D count the documents that were classified as irrelevant.
- Based on these numbers, the χ2 test measures the deviation of the given counts from
those that we would statistically expect when conditional independence is given.
- The χ2 value is calculated as follows:

- Higher values for χ2 indicate that the events of term occurrence and membership in a
class are not independent.
- To select features based on the χ2 test, the terms are first ranked by decreasing order
of their χ2 values.
- The logic behind that is that we want to include those features that help us to
determine class membership (or non-membership) first – that is, those for which class
membership and term occurrence are correlated.
- After sorting the terms, a number of experiments should be made to determine the
optimal number of features to use for the classifier.
- As mentioned previously, other techniques for feature selection, such as mutual
information or Fisher’s Discriminant, have also been proposed for use in information
retrieval scenarios.
- In many cases, however, these techniques result more or less in the same set of
keywords (maybe in different order) as long as different document lengths are taken
into account.

Limitations:
Pure content-based recommender systems have known limitations, which rather soon
led to the development of hybrid systems that combine the advantages of different
recommendation techniques.

Shallow content analysis:


- Particularly when web pages are the items to be recommended, capturing the quality
or interestingness of a web page by looking at the textual contents alone may not be
enough. Other aspects, such as aesthetics, usability, timeliness, or correctness of
hyperlinks, also determine the quality of a page.
- When keywords are used to characterize documents, a recommender cannot
differentiate between well-written articles and comparably poor papers that, naturally,
use the same set of keywords.
- Furthermore, in some application domains the text items to be recommended may not
be long enough to extract a good set of discriminating features.
- A typical example is the recommendation of jokes.

Page 9 of 10
- Learning a good preference profile from a very small set of features may be difficult
by itself; at the same time it is nearly impossible to distinguish, for instance, good
lawyer jokes from bad ones.

- Information in hypertext documents is also more and more contained in multimedia


elements, such as images, as well as audio and video sequences. These contents are
also not taken into account when only a shallow text analysis is done.
- Although some recent advances have been made in the area of feature extraction from
text documents, research in the extraction of features from multimedia content is still
at an early stage.

Overspecialization:
- Learning-based methods quickly tend to propose more of the same – that is, such
recommenders can propose only items that are somehow similar to the ones the
current user has already (positively) rated.
- This can lead to the undesirable effect that obvious recommendations are made and
the system, for instance, recommends items that are too similar to those the user
already knows.
- A typical example is a news filtering recommender that proposes a newspaper article
that covers the same story that the user has already seen in another context.
- Therefore define a threshold to filter out not only items that are too different from the
profile but also those that are too similar.
- A set of more elaborate metrics for measuring novelty and redundancy has been
analyzed.
- A general goal therefore is to increase the serendipity of the recommendation lists –
that is, to include “unexpected” items in which the user might be interested, because
expected items are of little value for the user.
- A simple way of avoiding monotonous lists is to “inject a note of randomness”
Acquiring ratings:
- The cold-start problem, which we discussed for collaborative systems, also exists in a
slightly different form for content-based recommendation methods.
- Although content-based techniques do not require a large user community, they
require at least an initial set of ratings from the user, typically a set of explicit “like”
and “dislike” statements.
- In all described filtering techniques, recommendation accuracy improves with the
number of ratings; significant performance increases for the learning algorithms when
the number of ratings was between twenty and fifty.
- However, in many domains, users might not be willing to rate so many items before
the recommender service can be used.
- In the initial phase, it could be an option to ask the user to provide a list of keywords,
either by selecting from a list of topics or by entering free-text input.

- Again, in the context of Web 2.0, it might be an option to “reuse” information that the
user may have provided or that was collected in the context of another personalized
(web) application and take such information as a starting point to incrementally
improve the user profile.

Page 10 of 10

You might also like